From d1396ad0ecbb4f04537cc4e9a0d681132b243170 Mon Sep 17 00:00:00 2001 From: wendlingd Date: Mon, 10 Sep 2018 11:03:06 -0400 Subject: [PATCH] Scripts first post --- 01_Text_wrangling.ipynb | 3274 ++++++++++++++++++++++ 02_Run_APIs.ipynb | 1077 +++++++ 02_Run_APIs.py | 983 +++++++ 03_Fuzzy_match.ipynb | 530 ++++ 04_Machine_learning_classification.ipynb | 906 ++++++ 05_Chart_the_trends.ipynb | 293 ++ 05b_Chart_the_trends-BiggestMovers.ipynb | 578 ++++ 06_Load_database.ipynb | 316 +++ 07_UI_building.ipynb | 135 + 08_Misc_fixes.ipynb | 306 ++ 10 files changed, 8398 insertions(+) create mode 100644 01_Text_wrangling.ipynb create mode 100644 02_Run_APIs.ipynb create mode 100644 02_Run_APIs.py create mode 100644 03_Fuzzy_match.ipynb create mode 100644 04_Machine_learning_classification.ipynb create mode 100644 05_Chart_the_trends.ipynb create mode 100644 05b_Chart_the_trends-BiggestMovers.ipynb create mode 100644 06_Load_database.ipynb create mode 100644 07_UI_building.ipynb create mode 100644 08_Misc_fixes.ipynb diff --git a/01_Text_wrangling.ipynb b/01_Text_wrangling.ipynb new file mode 100644 index 0000000..d6f2a9c --- /dev/null +++ b/01_Text_wrangling.ipynb @@ -0,0 +1,3274 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 1. Text wrangling\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** Resolve text formatting/syntax problems and match against historical file
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "\n", + "## Script contents\n", + "\n", + "1. Start-up / What to put into place, where\n", + "2. Unite search log data in single dataframe; globally update columns and rows\n", + "3. Separate out the queries with non-English characters\n", + "4. Run baseline dataset stats\n", + "5. Clean up content to improve matching\n", + "6. Make special-case assignments with F&R, RegEx: Bibliographic, Numeric, Named entities\n", + "7. Create logAfterGoldStandard - Match to the \"gold standard\" file of historical matches\n", + "8. Create 'uniques' dataframe/file for APIs\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "* [ ] Update from 1:1 capture to 1:n capture\n", + "* [ ] Add two more runs against the UMLS Metathesaurus API:\n", + " - Isolate non-English terms and remove them from percent-complete calcs.\n", + " Add separate statistics for non-English terms.\n", + " - Run remaining terms with \"word\" matching or \"approximate: matching;\n", + " compare those suggestions to ML suggestions. Create a df with one \n", + " column for each suggestion source, Metathesaurus-Approximate, \n", + " LinearSVC, LogisticRegresssion...\n", + "* [ ] Add summary visualizations / data quality dashboard\n", + "* [ ] Update Cognos search log reports:\n", + "** [ ] Change col names to one word: 'Search Timestamp': 'Timestamp', \n", + " 'NLM IP Y/N':'StaffYN', 'IP':'SessionID'\n", + "** [ ] Make it UTF-8-enough for Python \n", + "** [ ] Remove 8 col of blank cells\n", + "** [ ] Standardize Timestamp syntax b/w CSV and Excel formats\n", + "** [ ] I could isolate acronyms easier if queries with unaltered case were \n", + " available. I can lower-case things as needed. Reasons for receiving in lc? \n", + " Any reasons to keep it the way it is?\n", + "* [ ] Continue changing processing order. Perhaps avoid all the extra Semantic\n", + "Network assignments until very end of scripts 1 and 2.\n", + "\n", + "\n", + "## Cheat sheets for markdown text\n", + "\n", + "* https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html\n", + "* https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed\n", + "* https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/\n", + "\n", + "\n", + "## 1. Start-up / What to put into place, where\n", + "\n", + "Search log from internal site search. This script assumes an Excel file,\n", + " top two rows need to be ignored, and these columns:\n", + "\n", + "| ID | IP | NLM IP Y/N | Referrer | Query | Search Timestamp |\n", + "\n", + "ID - Unique row ID\n", + "IP - Unique session ID; anonomized session ID\n", + "NLM IP Y/N - Whether the query was from the NLM LAN, Y or N\n", + "Referrer - Where the visitor was when the search was submitted\n", + "Query - The query content\n", + "Search Timestamp - When the query was run\n", + "\n", + "Required for this script: Referrer, Query, Search Timestamp. I use Excel \n", + "because my source info system breaks CSV files when the query has commas." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import os\n", + "import string\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/_webDS')\n", + "\n", + "localDir = '01_Text_wrangling_files/'\n", + "\n", + "'''\n", + "Before running script, copy the following new files to /00 SourceFiles/; \n", + "adjust names below, as needed. Make them THE SAME TIME PERIOD - one month,\n", + "one quarter, one year, whatever.\n", + "'''\n", + "\n", + "# What is your new log file named?\n", + "newSearchLogFile = '00_Source_files/week31.xlsx'\n", + "\n", + "# Bring in historical file of (somewhat edited) matches\n", + "GoldStandard = localDir + 'GoldStandard_Master.xlsx'\n", + "GoldStandard = pd.read_excel(GoldStandard)\n", + "\n", + "'''\n", + "SemanticNetworkReference - Used in progress charts\n", + "\n", + "It's a customized version of the list at \n", + "https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html, \n", + "to be used to put search terms into huge bins. Can be integrated into \n", + "GoldStandard and be available if we want to see the progress of assignments\n", + "through the process.\n", + "'''\n", + "SemanticNetworkReference = localDir + 'SemanticNetworkReference.xlsx'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Unite search log data into single dataframe; globally update columns and rows\n", + "\n", + "If csv and Tab delimited, for example: pd.read_csv(filename, sep='\\t')\n", + "searchLog = pd.read_csv(newSearchLogFile)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 20639 entries, 0 to 20638\n", + "Data columns (total 6 columns):\n", + "ID 20639 non-null object\n", + "IP 20638 non-null object\n", + "NLM IP Y/N 20639 non-null object\n", + "Referrer 20638 non-null object\n", + "Query 20639 non-null object\n", + "Search Timestamp 20638 non-null datetime64[ns]\n", + "dtypes: datetime64[ns](1), object(5)\n", + "memory usage: 967.5+ KB\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SessionIDStaffYNReferrerQueryTimestampadjustedQueryCase
0FCB8C84AEDB5855CDDB2F29E38C8C8D1Nwww.nlm.nih.gov/lichen ruber mucosae2018-07-30 07:48:01.000lichen ruber mucosae
1D052BA917FD4489BD63014BE6568670ENwww.nlm.nih.gov/molecular identification of marine bacteria2018-07-30 01:14:26.000molecular identification of marine bacteria
2D052BA917FD4489BD63014BE6568670ENvsearch.nlm.nih.gov/vivisimo/cgi-bin/query-met...molecular identification of marine fishes bact...2018-07-30 01:23:06.000molecular identification of marine fishes bact...
347C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/serfile_addedinfo.htmlsecondaries brain prognostic factors2018-07-30 02:18:34.999secondaries brain prognostic factors
4993C3E958AB335FC500CB6BA0C03CBD8Nvsearch.nlm.nih.gov/vivisimo/cgi-bin/query-met...smoking&alzheimer's disease2018-07-30 02:26:16.999smoking&alzheimer's disease
\n", + "
" + ], + "text/plain": [ + " SessionID StaffYN \\\n", + "0 FCB8C84AEDB5855CDDB2F29E38C8C8D1 N \n", + "1 D052BA917FD4489BD63014BE6568670E N \n", + "2 D052BA917FD4489BD63014BE6568670E N \n", + "3 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "4 993C3E958AB335FC500CB6BA0C03CBD8 N \n", + "\n", + " Referrer \\\n", + "0 www.nlm.nih.gov/ \n", + "1 www.nlm.nih.gov/ \n", + "2 vsearch.nlm.nih.gov/vivisimo/cgi-bin/query-met... \n", + "3 www.nlm.nih.gov/bsd/serfile_addedinfo.html \n", + "4 vsearch.nlm.nih.gov/vivisimo/cgi-bin/query-met... \n", + "\n", + " Query Timestamp \\\n", + "0 lichen ruber mucosae 2018-07-30 07:48:01.000 \n", + "1 molecular identification of marine bacteria 2018-07-30 01:14:26.000 \n", + "2 molecular identification of marine fishes bact... 2018-07-30 01:23:06.000 \n", + "3 secondaries brain prognostic factors 2018-07-30 02:18:34.999 \n", + "4 smoking&alzheimer's disease 2018-07-30 02:26:16.999 \n", + "\n", + " adjustedQueryCase \n", + "0 lichen ruber mucosae \n", + "1 molecular identification of marine bacteria \n", + "2 molecular identification of marine fishes bact... \n", + "3 secondaries brain prognostic factors \n", + "4 smoking&alzheimer's disease " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Search log from Excel; IBM Cognos starts a new worksheet every 65k rows;\n", + "# open the file to see how many worksheets you need to bring in. FY18 q3 had 6 worksheets\n", + "searchLog = pd.read_excel(newSearchLogFile, skiprows=2)\n", + "'''\n", + "x2 = pd.read_excel(newSearchLogFile, 'Page1_2', skiprows=2)\n", + "x3 = pd.read_excel(newSearchLogFile, 'Page1_3', skiprows=2)\n", + "x4 = pd.read_excel(newSearchLogFile, 'Page1_4', skiprows=2)\n", + "x5 = pd.read_excel(newSearchLogFile, 'Page1_5', skiprows=2)\n", + "x6 = pd.read_excel(newSearchLogFile, 'Page1_6', skiprows=2)\n", + "# x5 = pd.read_excel('00 SourceFiles/2018-06/Queries-2018-05.xlsx', 'Page1_2', skiprows=2)\n", + "\n", + "searchLog = pd.concat([x1, x2, x3, x4, x5, x6], ignore_index=True) # , x3, x4, x5, x6, x7\n", + "'''\n", + "\n", + "searchLog.head(n=5)\n", + "searchLog.shape\n", + "searchLog.info()\n", + "searchLog.columns\n", + "\n", + "# Drop ID column, not needed\n", + "searchLog.drop(['ID'], axis=1, inplace=True)\n", + " \n", + "# Until Cognos report is fixed, problem of blank columns, multi-word col name\n", + "# Update col name\n", + "searchLog = searchLog.rename(columns={'Search Timestamp': 'Timestamp', \n", + " 'NLM IP Y/N':'StaffYN',\n", + " 'IP':'SessionID'})\n", + "\n", + "# Remove https:// to become joinable with traffic data\n", + "searchLog['Referrer'] = searchLog['Referrer'].str.replace('https://', '')\n", + "\n", + "# Dupe off the Query column into a lower-cased 'adjustedQueryCase', which \n", + "# will be the column you match against\n", + "searchLog['adjustedQueryCase'] = searchLog['Query'].str.lower()\n", + "\n", + "# Remove incomplete rows, which can cause errors later\n", + "searchLog = searchLog[~pd.isnull(searchLog['Referrer'])]\n", + "searchLog = searchLog[~pd.isnull(searchLog['Query'])]\n", + "searchLog.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"\\n# Start new df so you can revert if needed\\nsearchLogClean = nonForeign\\n\\n\\n\\n# When restarting work or recovering from error later, use cleaned log from file\\n# newSearchLogFile = '00 SourceFiles/2018-04/q2_2018-en-us.xlsx'\\n# searchLogClean = pd.read_excel(localDir + 'searchLogClean.xlsx')\\n\\n# Remove showForeign, nonForeign, searchLog\\n\"" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# *** STILL NEEDED?? - COMMENTED OUT TO AVOID DAMAGING IN AUTO-RUN ***\n", + "# Eyeball df and remove (some) foreign-language entries APIs can't match - non-Roman's (??)\n", + "\n", + "'''\n", + "showForeign = searchLog.sort_values(by='adjustedQueryCase', ascending=False)\n", + "showForeign = showForeign.reset_index()\n", + "showForeign.drop(['index'], axis=1, inplace=True)\n", + "\n", + "nonForeign = showForeign[330:] # Eyeball, update to remove down to the rows the APIs will be able to parse\n", + "\n", + "# Eyeball, sorting by adjustedQueryCase, remove specific useless rows as needed\n", + "nonForeign.drop(41402, inplace=True)\n", + "nonForeign.drop(41401, inplace=True)\n", + "nonForeign.drop(19657, inplace=True)\n", + "nonForeign.drop(19656, inplace=True)\n", + "nonForeign.drop(19655, inplace=True)\n", + "nonForeign.drop(19654, inplace=True)\n", + "nonForeign.drop(19646, inplace=True)\n", + "nonForeign.drop(19647, inplace=True)\n", + "\n", + "# Space clean-up as needed\n", + "nonForeign['adjustedQueryCase'] = nonForeign['adjustedQueryCase'].str.replace(' ', ' ') # two spaces to one\n", + "nonForeign['adjustedQueryCase'] = nonForeign['adjustedQueryCase'].str.strip() # remove leading and trailing spaces\n", + "nonForeign = nonForeign.loc[(nonForeign['adjustedQueryCase'] != \"\")]\n", + "'''\n", + "\n", + "'''\n", + "# Start new df so you can revert if needed\n", + "searchLogClean = nonForeign\n", + "\n", + "# Remove showForeign, nonForeign, searchLog\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# When restarting work or recovering from error later, use cleaned log from file\n", + "# newSearchLogFile = '00 SourceFiles/2018-04/q2_2018-en-us.xlsx'\n", + "# searchLogClean = pd.read_excel(localDir + 'searchLogClean.xlsx')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Run baseline dataset stats\n", + "\n", + "Not the purpose of this project, but other staff have use for... Before we\n", + "cut down the log content, some quick calcuations.\n", + "\n", + "Future: Overall percentage of hit-and-runs, 'one and done'\n", + "Group and count by IP address - Create table of counts\n", + "\n", + "numberOfSearches = searchLog\n", + "numberOfSearches.groupby(['ID']).size()\n", + "numberOfSearches = numberOfSearches.reset_index()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total searches in raw log file: 20638\n", + "Total SEARCHES, on NLM LAN or not\n", + "N 20432\n", + "Y 206\n", + "Name: StaffYN, dtype: int64\n", + "Total SESSIONS, on NLM LAN or not\n", + "StaffYN\n", + "N 7993\n", + "Y 41\n", + "Name: SessionID, dtype: int64\n" + ] + }, + { + "data": { + "text/plain": [ + "\"\\n# How to set a date range\\nAprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')]\\n\"" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(\"Total searches in raw log file: {}\".format(len(searchLog)))\n", + "\n", + "# tot\n", + "print(\"Total SEARCHES, on NLM LAN or not\\n{}\".format(searchLog['StaffYN'].value_counts()))\n", + "\n", + "print(\"Total SESSIONS, on NLM LAN or not\\n{}\".format(searchLog.groupby('StaffYN')['SessionID'].nunique()))\n", + "\n", + "# If you see digits in text col, perhaps these are partial log entries - eyeball for removal\n", + "# searchLog.drop(76080, inplace=True)\n", + "\n", + "\n", + "# Total SEARCHES containing 'Non-English characters'\n", + "# print(\"Total SEARCHES with non-English characters\\n{}\".format(searchLog['preferredTerm'].value_counts()))\n", + "\n", + "# Total SESSIONS containing 'Non-English characters'\n", + "# Future\n", + "\n", + "'''\n", + "# How to set a date range\n", + "AprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')]\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Top staff queries as enteredCount
012347
1nnlm4
2dennis benson4
3errata3
4medline2
5itas2
6lister hill auditorium2
7nlm logo2
8strategic plan2
9sheridan2
10staff library2
11nichsr2
12turning the pages2
13sis2
14https://www.nlm.nih.gov/services/nlmchat.html2
15digital collections2
16locatorplus2
17urgoclean: the evidence base1
18mesh1
19nlm service and hours1
\n", + "
" + ], + "text/plain": [ + " Top staff queries as entered Count\n", + "0 1234 7\n", + "1 nnlm 4\n", + "2 dennis benson 4\n", + "3 errata 3\n", + "4 medline 2\n", + "5 itas 2\n", + "6 lister hill auditorium 2\n", + "7 nlm logo 2\n", + "8 strategic plan 2\n", + "9 sheridan 2\n", + "10 staff library 2\n", + "11 nichsr 2\n", + "12 turning the pages 2\n", + "13 sis 2\n", + "14 https://www.nlm.nih.gov/services/nlmchat.html 2\n", + "15 digital collections 2\n", + "16 locatorplus 2\n", + "17 urgoclean: the evidence base 1\n", + "18 mesh 1\n", + "19 nlm service and hours 1" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Top queries from LAN (not normalized)\n", + "searchLogLanYes = searchLog.loc[searchLog['StaffYN'].str.contains('Y') == True]\n", + "searchLogLanYesQueryCounts = searchLogLanYes['Query'].value_counts()\n", + "searchLogLanYesQueryCounts = searchLogLanYesQueryCounts.reset_index()\n", + "searchLogLanYesQueryCounts = searchLogLanYesQueryCounts.rename(columns={'index': 'Top staff queries as entered', 'Query': 'Count'})\n", + "searchLogLanYesQueryCounts.head(n=20)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Top queries from NLM LAN, from Home, as enteredCount
012347
1dennis benson4
2errata3
3sis2
4lister hill auditorium2
5strategic plan2
6turning the pages2
7nichsr2
8sheridan2
9nlm logo2
10medline2
11digital collections2
12locatorplus2
13itas2
14urgoclean: the evidence base1
15news and events1
16nlm service and hours1
17indexing in medline1
18mesh1
19cords1
20dreger1
21drug information1
22oid1
23phd1
24hsrproj1
\n", + "
" + ], + "text/plain": [ + " Top queries from NLM LAN, from Home, as entered Count\n", + "0 1234 7\n", + "1 dennis benson 4\n", + "2 errata 3\n", + "3 sis 2\n", + "4 lister hill auditorium 2\n", + "5 strategic plan 2\n", + "6 turning the pages 2\n", + "7 nichsr 2\n", + "8 sheridan 2\n", + "9 nlm logo 2\n", + "10 medline 2\n", + "11 digital collections 2\n", + "12 locatorplus 2\n", + "13 itas 2\n", + "14 urgoclean: the evidence base 1\n", + "15 news and events 1\n", + "16 nlm service and hours 1\n", + "17 indexing in medline 1\n", + "18 mesh 1\n", + "19 cords 1\n", + "20 dreger 1\n", + "21 drug information 1\n", + "22 oid 1\n", + "23 phd 1\n", + "24 hsrproj 1" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Top queries from NLM LAN, from NLM Home (not normalized)\n", + "searchLogLanYesHmPg = searchLog.loc[searchLog['StaffYN'].str.contains('Y') == True]\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "searchLogLanYesHmPg = searchLogLanYesHmPg[searchLogLanYesHmPg.Referrer.str.contains('|'.join(searchfor))]\n", + "searchLogLanYesHmPgQueryCounts = searchLogLanYesHmPg['Query'].value_counts()\n", + "searchLogLanYesHmPgQueryCounts = searchLogLanYesHmPgQueryCounts.reset_index()\n", + "searchLogLanYesHmPgQueryCounts = searchLogLanYesHmPgQueryCounts.rename(columns={'index': 'Top queries from NLM LAN, from Home, as entered', 'Query': 'Count'})\n", + "searchLogLanYesHmPgQueryCounts.head(n=25)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Top queries off of LAN, as enteredCount
0search49
1diabetes41
2endnote24
3index medicus24
4mesh23
5cancer22
6pubmed21
7calcium channel blockers20
8international journal of scientific research15
9metabolic syndrome14
10rotator cuff injuries13
11stroke13
12keywords12
13nursing12
14breast cancer11
15suicide11
16depression11
17tuberculosis10
18heart10
19rxnorm10
20egg10
21adhd9
22icdk9 c18 acn乙腈洗脱9
23vancouver9
24an attempt at isolation and characterization o...9
\n", + "
" + ], + "text/plain": [ + " Top queries off of LAN, as entered Count\n", + "0 search 49\n", + "1 diabetes 41\n", + "2 endnote 24\n", + "3 index medicus 24\n", + "4 mesh 23\n", + "5 cancer 22\n", + "6 pubmed 21\n", + "7 calcium channel blockers 20\n", + "8 international journal of scientific research 15\n", + "9 metabolic syndrome 14\n", + "10 rotator cuff injuries 13\n", + "11 stroke 13\n", + "12 keywords 12\n", + "13 nursing 12\n", + "14 breast cancer 11\n", + "15 suicide 11\n", + "16 depression 11\n", + "17 tuberculosis 10\n", + "18 heart 10\n", + "19 rxnorm 10\n", + "20 egg 10\n", + "21 adhd 9\n", + "22 icdk9 c18 acn乙腈洗脱 9\n", + "23 vancouver 9\n", + "24 an attempt at isolation and characterization o... 9" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Top queries outside NLM LAN (not normalized)\n", + "searchLogLanNo = searchLog.loc[searchLog['StaffYN'].str.contains('N') == True]\n", + "searchLogLanNoQueryCounts = searchLogLanNo['Query'].value_counts()\n", + "searchLogLanNoQueryCounts = searchLogLanNoQueryCounts.reset_index()\n", + "searchLogLanNoQueryCounts = searchLogLanNoQueryCounts.rename(columns={'index': 'Top queries off of LAN, as entered', 'Query': 'Count'})\n", + "searchLogLanNoQueryCounts.head(n=25)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Top queries off of LAN, from Home, as enteredCount
0calcium channel blockers20
1diabetes20
2index medicus16
3mesh10
4stevia8
5heart8
6mrsa7
7bubonic plague7
8xanax7
9stroke6
10sunscreen6
11rxnorm6
12tuberculosis6
13nutrition6
14data, tools, and statistics6
15immunocytochemical study of human lymphoid tis...6
16fibromyalgia6
17foreign matter enters the eye6
18hemohim5
19teicoplanin5
20keywords5
21prevention techniques5
22pillbox5
23testosterone5
24depression5
\n", + "
" + ], + "text/plain": [ + " Top queries off of LAN, from Home, as entered Count\n", + "0 calcium channel blockers 20\n", + "1 diabetes 20\n", + "2 index medicus 16\n", + "3 mesh 10\n", + "4 stevia 8\n", + "5 heart 8\n", + "6 mrsa 7\n", + "7 bubonic plague 7\n", + "8 xanax 7\n", + "9 stroke 6\n", + "10 sunscreen 6\n", + "11 rxnorm 6\n", + "12 tuberculosis 6\n", + "13 nutrition 6\n", + "14 data, tools, and statistics 6\n", + "15 immunocytochemical study of human lymphoid tis... 6\n", + "16 fibromyalgia 6\n", + "17 foreign matter enters the eye 6\n", + "18 hemohim 5\n", + "19 teicoplanin 5\n", + "20 keywords 5\n", + "21 prevention techniques 5\n", + "22 pillbox 5\n", + "23 testosterone 5\n", + "24 depression 5" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Top queries outside NLM LAN, from NLM Home (not normalized)\n", + "searchLogLanNoHmPg = searchLog.loc[searchLog['StaffYN'].str.contains('N') == True]\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "searchLogLanNoHmPg = searchLogLanNoHmPg[searchLogLanNoHmPg.Referrer.str.contains('|'.join(searchfor))]\n", + "searchLogLanNoHmPgQueryCounts = searchLogLanNoHmPg['Query'].value_counts()\n", + "searchLogLanNoHmPgQueryCounts = searchLogLanNoHmPgQueryCounts.reset_index()\n", + "searchLogLanNoHmPgQueryCounts = searchLogLanNoHmPgQueryCounts.rename(columns={'index': 'Top queries off of LAN, from Home page, as entered', 'Query': 'Count'})\n", + "searchLogLanNoHmPgQueryCounts.head(n=25)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Top home page queries, staff or public, as enteredCount
0diabetes20
1calcium channel blockers20
2index medicus16
3mesh11
4heart9
5stevia8
612347
7xanax7
8mrsa7
9tuberculosis7
10bubonic plague7
11immunocytochemical study of human lymphoid tis...6
12sunscreen6
13nutrition6
14foreign matter enters the eye6
15rxnorm6
16fibromyalgia6
17data, tools, and statistics6
18stroke6
19pillbox5
20hemohim5
21depression5
22magnet hospitals instiutions of excellence5
23keywords5
24journal manuscript guidelines5
\n", + "
" + ], + "text/plain": [ + " Top home page queries, staff or public, as entered Count\n", + "0 diabetes 20\n", + "1 calcium channel blockers 20\n", + "2 index medicus 16\n", + "3 mesh 11\n", + "4 heart 9\n", + "5 stevia 8\n", + "6 1234 7\n", + "7 xanax 7\n", + "8 mrsa 7\n", + "9 tuberculosis 7\n", + "10 bubonic plague 7\n", + "11 immunocytochemical study of human lymphoid tis... 6\n", + "12 sunscreen 6\n", + "13 nutrition 6\n", + "14 foreign matter enters the eye 6\n", + "15 rxnorm 6\n", + "16 fibromyalgia 6\n", + "17 data, tools, and statistics 6\n", + "18 stroke 6\n", + "19 pillbox 5\n", + "20 hemohim 5\n", + "21 depression 5\n", + "22 magnet hospitals instiutions of excellence 5\n", + "23 keywords 5\n", + "24 journal manuscript guidelines 5" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Top home page queries, staff or public\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "searchLogAllHmPgQueryCounts = searchLog[searchLog.Referrer.str.contains('|'.join(searchfor))]\n", + "searchLogAllHmPgQueryCounts = searchLogAllHmPgQueryCounts['Query'].value_counts()\n", + "searchLogAllHmPgQueryCounts = searchLogAllHmPgQueryCounts.reset_index()\n", + "searchLogAllHmPgQueryCounts = searchLogAllHmPgQueryCounts.rename(columns={'index': 'Top home page queries, staff or public, as entered', 'Query': 'Count'})\n", + "searchLogAllHmPgQueryCounts.head(n=25)\n", + "\n", + "# Add table, Percentage of staff, public searches done within pages, within search results\n", + "\n", + "# Add table for Top queries with columns/counts On LAN, Off LAN, Total\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"\\nRemove manually for now.\\nNot finding an equiv to R's rm; cf https://stackoverflow.com/questions/32247643/how-to-delete-multiple-pandas-python-dataframes-from-memory-to-save-ram?rq=1\\npd.x1(), pd.x2(), # pd.x3(), pd.x4(), pd.x5(), pd.x6(), pd.x7(), \\n pd.searchLogLanYes(), pd.searchLogLanYesHmPg(), \\n pd.searchLogLanNo(), pd.searchLogLanNoHmPg(),\\n pd.searchLogAllHmPg()\\n\"" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Remove the searches run from within search results screens, vsearch.nlm.nih.gov/vivisimo/\n", + "# I'm not looking at these now; you might be.\n", + "searchLog = searchLog[searchLog.Referrer.str.startswith(\"www.nlm.nih.gov\") == True]\n", + "\n", + "# Not sure what these are, www.nlm.nih.gov/?_ga=2.95055260.1623044406.1513044719-1901803437.1513044719\n", + "searchLog = searchLog[searchLog.Referrer.str.startswith(\"www.nlm.nih.gov/?_ga=\") == False]\n", + "\n", + "# FIXME - VARIABLE EXPLORER: After saving the stats, remove unneeded 'Type=DataFrame' items\n", + "'''\n", + "Remove manually for now.\n", + "Not finding an equiv to R's rm; cf https://stackoverflow.com/questions/32247643/how-to-delete-multiple-pandas-python-dataframes-from-memory-to-save-ram?rq=1\n", + "pd.x1(), pd.x2(), # pd.x3(), pd.x4(), pd.x5(), pd.x6(), pd.x7(), \n", + " pd.searchLogLanYes(), pd.searchLogLanYesHmPg(), \n", + " pd.searchLogLanNo(), pd.searchLogLanNoHmPg(),\n", + " pd.searchLogAllHmPg()\n", + "'''" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SessionIDStaffYNReferrerQueryTimestampadjustedQueryCase
0FCB8C84AEDB5855CDDB2F29E38C8C8D1Nwww.nlm.nih.gov/lichen ruber mucosae2018-07-30 07:48:01.000lichen ruber mucosae
1D052BA917FD4489BD63014BE6568670ENwww.nlm.nih.gov/molecular identification of marine bacteria2018-07-30 01:14:26.000molecular identification of marine bacteria
347C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/serfile_addedinfo.htmlsecondaries brain prognostic factors2018-07-30 02:18:34.999secondaries brain prognostic factors
60D3354A8E8C07196F17340B2C641487ENwww.nlm.nih.gov/nlmhome.htmlparasites2018-07-30 02:43:59.999parasites
85EC71AEB4FDE600004405F91FD0F0379Nwww.nlm.nih.gov/bsd/pmresources.htmlvojta2018-07-30 06:19:02.000vojta
\n", + "
" + ], + "text/plain": [ + " SessionID StaffYN \\\n", + "0 FCB8C84AEDB5855CDDB2F29E38C8C8D1 N \n", + "1 D052BA917FD4489BD63014BE6568670E N \n", + "3 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "6 0D3354A8E8C07196F17340B2C641487E N \n", + "8 5EC71AEB4FDE600004405F91FD0F0379 N \n", + "\n", + " Referrer \\\n", + "0 www.nlm.nih.gov/ \n", + "1 www.nlm.nih.gov/ \n", + "3 www.nlm.nih.gov/bsd/serfile_addedinfo.html \n", + "6 www.nlm.nih.gov/nlmhome.html \n", + "8 www.nlm.nih.gov/bsd/pmresources.html \n", + "\n", + " Query Timestamp \\\n", + "0 lichen ruber mucosae 2018-07-30 07:48:01.000 \n", + "1 molecular identification of marine bacteria 2018-07-30 01:14:26.000 \n", + "3 secondaries brain prognostic factors 2018-07-30 02:18:34.999 \n", + "6 parasites 2018-07-30 02:43:59.999 \n", + "8 vojta 2018-07-30 06:19:02.000 \n", + "\n", + " adjustedQueryCase \n", + "0 lichen ruber mucosae \n", + "1 molecular identification of marine bacteria \n", + "3 secondaries brain prognostic factors \n", + "6 parasites \n", + "8 vojta " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# 5. Clean up content to improve matching\n", + "# ========================================\n", + "'''\n", + "NOTE: Do not limit to a-zA-Z0-9 re: non-English character sets.\n", + "'''\n", + "\n", + "# FIXME - Remove punctuation. Also must include a fix for punct at start WITH trailing space\n", + "\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\"', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(\"'\", \"\")\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(\"`\", \"\")\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('(', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(')', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('.', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(',', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('!', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('#NAME?', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('*', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('$', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('+', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('?', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('!', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('#', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('%', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(':', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(';', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('{', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('}', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('|', '')\n", + "\n", + "# searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('-', '')\n", + "\n", + "# Is backslash required?\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\\^', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\\[', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\\]', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\\<', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\\>', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('\\\\', '')\n", + "\n", + "\n", + "# First-character issues\n", + "# searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^[0-9]{4}\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^-\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^/\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^@\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^;\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^<\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^>\") == False] # char entities\n", + "\n", + "# If removing punct caused a preceding space, remove the space.\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^ ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^ ', '')\n", + "\n", + "# Drop junk rows\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.startswith(\"&#\") == False] # char entities\n", + "searchLog = searchLog[searchLog.adjustedQueryCase.str.contains(\"^&[0-9]{4}\") == False] # char entities\n", + "\n", + "# Remove modified entries that are now dupes or blank entries\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace(' ', ' ') # two spaces to one\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.strip() # remove leading and trailing spaces\n", + "searchLog = searchLog.loc[(searchLog['adjustedQueryCase'] != \"\")]\n", + "\n", + "\n", + "# Test - Does the following do anything, good or bad? Can't tell. Remove non-ASCII; https://www.quora.com/How-do-I-remove-non-ASCII-characters-e-g-%C3%90%C2%B1%C2%A7%E2%80%A2-%C2%B5%C2%B4%E2%80%A1%C5%BD%C2%AE%C2%BA%C3%8F%C6%92%C2%B6%C2%B9-from-texts-in-Panda%E2%80%99s-DataFrame-columns\n", + "# I think a previous operation converted these already, for example, دوشن\n", + "# def remove_non_ascii(Query):\n", + "# return ''.join(i for i in Query if ord(i)<128)\n", + "# testingOnly = uniqueSearchTerms['Query'] = uniqueSearchTerms['Query'].apply(remove_non_ascii)\n", + "# Also https://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space?rq=1\n", + "\n", + "# Remove starting text that can complicate matching\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^benefits of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^cause of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^cause for ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^causes for ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^causes of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^definition for ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^definition of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^effect of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^etiology of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^symptoms of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^treating ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^treatment for ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^treatments for ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^treatment of ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^what are ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^what causes ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^what is a ', '')\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^what is ', '')\n", + "\n", + "# Is this one different than the above? Such as, pathology of the lung\n", + "searchLog['adjustedQueryCase'] = searchLog['adjustedQueryCase'].str.replace('^pathology of ', '')\n", + "\n", + "searchLog.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Make special-case assignments with F&R, RegEx: Bibliographic, Numeric, Named entities\n", + "\n", + "Later procedures can't match the below very well. For rows we know won't be matchable later, assign preferredTerm here - PubMed search strategies, whatever. Future: clean these up, add nuance.\n", + "\n", + "- Remove errant punctuation; create dataframe of unique terms with frequency.\n", + "- Run list of RegEx operations based on solutions to historical problems. \n", + " During the rest of the project, update RegEx list to improve future matching.\n", + " Designators such as PMIDs, DOIs, ISSNs, ISBNs, etc.\n", + " Tag PubMed search strategies - i.e., find entries with \\[DT\\], other \n", + " PubMed field tags, in entries that are over ~20 characters, indicating \n", + " the entry is a PubMed search strategy.\n", + "- Remove known errant punctuation with RegEx\n", + "- Use the \"gold standard\" history file - previously matched terms that have \n", + "been assigned from UMLS, plus vetted terms that were matched from \n", + "dictionaries (named entities), etc., added manually, AND vetted. Solve in \n", + "the new file everything that was solved in the past.\n", + "\n", + "We have one page on the web site that is a HUGE OUTLIER, where ~30% of searches are run from, and we\n", + "know that people are running PubMed searches in this site search blank. \n", + "Pre-assign these to avoid blasting the API matching resources unnecessarily?\n", + "\n", + "FIXME - The below preferredTerm entries are added before several cols are\n", + "available. Later on you will need to assign SemanticTypeName, etc., so these \n", + "rows will be picked up in the status charts.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Assignments to preferredTerm\n", + "Bibliographic Entity 3901\n", + "pmresources.html 1502\n", + "Numeric Entity 417\n", + "Name: preferredTerm, dtype: int64\n" + ] + } + ], + "source": [ + "# Start new df for protection - rollbacks if needed\n", + "searchLogClean = searchLog\n", + "\n", + "# --- pmresources.html ---\n", + "searchLogClean.loc[searchLogClean['Referrer'].str.contains('/bsd/pmresources.html'), 'preferredTerm'] = 'pmresources.html'\n", + "# ToTestThis = searchLogClean[searchLogClean.Referrer.str.contains(\"/bsd/pmresources.html\") == True]\n", + "\n", + "\n", + "# --- Bibliographic Entity ---\n", + "# Assign ALL queries over 20 char to 'Bibliographic Entity' (often citations, search strategies, pub titles...)\n", + "searchLogClean.loc[(searchLogClean['adjustedQueryCase'].str.len() > 20), 'preferredTerm'] = 'Bibliographic Entity'\n", + "\n", + "# searchLogClean.loc[(searchLogClean['adjustedQueryCase'].str.len() > 25) & (~searchLogClean['preferredTerm'].str.contains('pmresources.html', na=False)), 'preferredTerm'] = 'Bibliographic Entity'\n", + "\n", + "# Search strategies might also be in the form \"clinical trial\" and \"phase 0\"\n", + "searchLogClean.loc[searchLogClean['adjustedQueryCase'].str.contains('[a-z]{3,}\" and \"[a-z]{3,}', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "\n", + "# Search strategies might also be in the form \"clinical trial\" and \"phase 0\"\n", + "searchLogClean.loc[searchLogClean['adjustedQueryCase'].str.contains('[a-z]{3,}\" and \"[a-z]{3,}', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "\n", + "# Queries about specific journal titles\n", + "searchLogClean.loc[searchLogClean['adjustedQueryCase'].str.contains('^journal of', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "searchLogClean.loc[searchLogClean['adjustedQueryCase'].str.contains('^international journal of', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "\n", + "\n", + "# --- Numeric Entity ---\n", + "# Assign entries starting with 3 digits\n", + "# FIXME - Clarify and grab the below, PMID, ISSN, ISBN, etc.\n", + "searchLogClean.loc[searchLogClean['adjustedQueryCase'].str.contains('^[0-9]{3,}', na=False), 'preferredTerm'] = 'Numeric Entity'\n", + "searchLogClean.loc[searchLogClean['adjustedQueryCase'].str.contains('[0-9]{5,}', na=False), 'preferredTerm'] = 'Numeric Entity'\n", + "\n", + "# Fix this - I want to restrict to entries starting with 3 digits. Can't find solution.\n", + "# After this, might want to let loose of dates, from 201? to 202? or similar\n", + "\n", + "\n", + "'''\n", + "If trying to clean up later in the process\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['adjustedQueryCase'].str.contains('^journal of', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['adjustedQueryCase'].str.contains('^international journal of', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['adjustedQueryCase'].str.contains('^[0-9]{3,}', na=False), 'preferredTerm'] = 'Numeric Entity'\n", + "\n", + "Assign '^pmid [0-9]'\n", + "Assign '^pmc [0-9]'\n", + "More - ISSNs are probably ####-####, etc. Match syntax to different types\n", + "of numbers\n", + "\n", + "# How diffent numbers might manifest\n", + "# MeSH unique IDs d009369 (d and 6 digits)\n", + "\n", + "# PMIDs pmid 23193287, pmid23193287\n", + "# PMC IDs pmc5419604, pmc/articles/pmc3221073\n", + "\n", + "nlm uid 8207799\n", + "nm_001096633\n", + "np_0105443\n", + "nr_1039342\n", + "pmc3183535\n", + "pmid 7187238\n", + "x95160\n", + "wp_012745022\n", + "\n", + "# Info about PMID, PMCID, Manuscript ID, DOI: https://www.ncbi.nlm.nih.gov/pmc/pmctopmid/#converter\n", + "'''\n", + "\n", + "print(\"Assignments to preferredTerm\\n{}\".format(searchLogClean['preferredTerm'].value_counts()))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Total queries in searchLogClean: 11655\n", + "\n", + "Pre-processing assignments:\n", + "Bibliographic Entity 3901\n", + "pmresources.html 1502\n", + "Numeric Entity 417\n", + "Name: preferredTerm, dtype: int64\n", + "\n", + "Assigned: 5820\n", + "Unassigned: 5835\n", + "\n", + "Percent of queries to resolve: 50.0%\n" + ] + } + ], + "source": [ + "# Useful to write out the cleaned up version; if you do re-processing, you can skip a bunch of work.\n", + "writer = pd.ExcelWriter(localDir + 'searchLogClean.xlsx')\n", + "searchLogClean.to_excel(writer,'searchLogClean')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "# -------------\n", + "# How we doin?\n", + "# -------------\n", + "\n", + "TotQueries = len(searchLogClean)\n", + "Assigned = searchLogClean['preferredTerm'].notnull().sum()\n", + "Unassigned = searchLogClean['preferredTerm'].isnull().sum()\n", + "PercentUnassigned = (Unassigned / TotQueries) * 100\n", + "\n", + "print(\"\\nTotal queries in searchLogClean: {}\".format(TotQueries))\n", + "print(\"\\nPre-processing assignments:\\n{}\".format(searchLogClean['preferredTerm'].value_counts()))\n", + "print(\"\\nAssigned: {}\".format(Assigned))\n", + "print(\"Unassigned: {}\".format(Unassigned))\n", + "print(\"\\nPercent of queries to resolve: {}%\".format(round(PercentUnassigned)))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"\"\"\n", + "To look further occasionally at problems that might remain, that might require more-manual intervention...\n", + "\n", + "print(searchLogClean['preferredTerm'].value_counts())\n", + "\n", + "unassignedNow = searchLogClean[pd.isnull(searchLogClean['preferredTerm'])]\n", + "\n", + "unassignedNowUnique = unassignedNow.groupby('adjustedQueryCase').size()\n", + "unassignedNowUnique = pd.DataFrame({'timesSearched':unassignedNowUnique})\n", + "unassignedNowUnique = unassignedNowUnique.reset_index()\n", + "unassignedNowUnique = unassignedNowUnique.sort_values(by='timesSearched', ascending=True)\n", + "\n", + "# Drop rows\n", + "unassignedNowUnique = unassignedNowUnique[unassignedNowUnique.adjustedQueryCase.str.startswith(\"&\") == False] # char entities\n", + "\n", + "# df.col1.str.contains('^[Cc]ountry')\n", + "\n", + "unassignedNowUnique2 = unassignedNowUnique[unassignedNowUnique.adjustedQueryCase.str.contains(\"^&[0-9]{4}\") == False] # char entities\n", + "\n", + "# PMIDs pmid 23193287, pmid23193287\n", + "searchLogClean.loc[searchLogClean['Query'].str.contains('pmid [0-9]{8}|pmid [0-9]{8}', na=False), 'preferredTerm'] = 'Numeric Entity-PubMed'\n", + "print(searchLogClean['preferredTerm'].value_counts())\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 7. Create logAfterGoldStandard - Match to the \"gold standard\" file of historical matches\n", + "\n", + "Maintain a list of UMLS Semantic Network terms that you've already matched\n", + "and edited for accuracy. Over time, applying your history of vetted matches \n", + "before going out to the UMLS API should lighten your overall workload.\n", + "\n", + "9/9/2018: GoldStandard should be replaced after changing API to 1:n capture.\n", + "\n", + "Abandoned method\n", + "logWithNamedEntities = pd.read_excel(localDir + 'logWithNamedEntities.xlsx')\n", + "GoldStandard = localDir + 'GoldStandard.xlsx'\n", + "\n", + "GoldStandard = pd.read_excel(localDir + 'GoldStandard.xlsx')\n", + "GoldStandard['adjustedQueryCase'] = GoldStandard['Query'].str.lower()\n", + "GoldStandard.rename(columns={'Query': 'origQuery'}, inplace=True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SessionIDStaffYNReferrerQueryTimestampadjustedQueryCasepreferredTerm_xAddressBranchPositionCustomTreeNumberEntrySourceResourceTypeSemanticGroupSemanticGroupCodeSemanticTypeNamecontentStewardpreferredTerm_y
0FCB8C84AEDB5855CDDB2F29E38C8C8D1Nwww.nlm.nih.gov/lichen ruber mucosae2018-07-30 07:48:01.000lichen ruber mucosaeNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1D052BA917FD4489BD63014BE6568670ENwww.nlm.nih.gov/molecular identification of marine bacteria2018-07-30 01:14:26.000molecular identification of marine bacteriaBibliographic EntityNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
247C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/serfile_addedinfo.htmlsecondaries brain prognostic factors2018-07-30 02:18:34.999secondaries brain prognostic factorsBibliographic EntityNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
30D3354A8E8C07196F17340B2C641487ENwww.nlm.nih.gov/nlmhome.htmlparasites2018-07-30 02:43:59.999parasitesNaNNaN4.01113.0NaNNaNLiving Beings9.0EukaryoteNaNParasites
40D3354A8E8C07196F17340B2C641487ENwww.nlm.nih.gov/nlmhome.htmlparasites2018-07-30 02:43:59.999parasitesNaNNaN4.01113.0NaNNaNLiving Beings9.0EukaryoteNaNParasites
\n", + "
" + ], + "text/plain": [ + " SessionID StaffYN \\\n", + "0 FCB8C84AEDB5855CDDB2F29E38C8C8D1 N \n", + "1 D052BA917FD4489BD63014BE6568670E N \n", + "2 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "3 0D3354A8E8C07196F17340B2C641487E N \n", + "4 0D3354A8E8C07196F17340B2C641487E N \n", + "\n", + " Referrer \\\n", + "0 www.nlm.nih.gov/ \n", + "1 www.nlm.nih.gov/ \n", + "2 www.nlm.nih.gov/bsd/serfile_addedinfo.html \n", + "3 www.nlm.nih.gov/nlmhome.html \n", + "4 www.nlm.nih.gov/nlmhome.html \n", + "\n", + " Query Timestamp \\\n", + "0 lichen ruber mucosae 2018-07-30 07:48:01.000 \n", + "1 molecular identification of marine bacteria 2018-07-30 01:14:26.000 \n", + "2 secondaries brain prognostic factors 2018-07-30 02:18:34.999 \n", + "3 parasites 2018-07-30 02:43:59.999 \n", + "4 parasites 2018-07-30 02:43:59.999 \n", + "\n", + " adjustedQueryCase preferredTerm_x Address \\\n", + "0 lichen ruber mucosae NaN NaN \n", + "1 molecular identification of marine bacteria Bibliographic Entity NaN \n", + "2 secondaries brain prognostic factors Bibliographic Entity NaN \n", + "3 parasites NaN NaN \n", + "4 parasites NaN NaN \n", + "\n", + " BranchPosition CustomTreeNumber EntrySource ResourceType SemanticGroup \\\n", + "0 NaN NaN NaN NaN NaN \n", + "1 NaN NaN NaN NaN NaN \n", + "2 NaN NaN NaN NaN NaN \n", + "3 4.0 1113.0 NaN NaN Living Beings \n", + "4 4.0 1113.0 NaN NaN Living Beings \n", + "\n", + " SemanticGroupCode SemanticTypeName contentSteward preferredTerm_y \n", + "0 NaN NaN NaN NaN \n", + "1 NaN NaN NaN NaN \n", + "2 NaN NaN NaN NaN \n", + "3 9.0 Eukaryote NaN Parasites \n", + "4 9.0 Eukaryote NaN Parasites " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# FIXME - see notes below, problem here\n", + "logAfterGoldStandard = pd.merge(searchLogClean, GoldStandard, how='left', on='adjustedQueryCase')\n", + "\n", + "logAfterGoldStandard.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['SessionID', 'StaffYN', 'Referrer', 'Query', 'Timestamp',\n", + " 'adjustedQueryCase', 'preferredTerm_x', 'Address', 'BranchPosition',\n", + " 'CustomTreeNumber', 'EntrySource', 'ResourceType', 'SemanticGroup',\n", + " 'SemanticGroupCode', 'SemanticTypeName', 'contentSteward',\n", + " 'preferredTerm_y'],\n", + " dtype='object')" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# logAfterGoldStandard.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# Future: Look for a better way to do the above - MERGE WITH CONDITIONAL OVERWRITE. Temporary fix:\n", + "logAfterGoldStandard['preferredTerm2'] = logAfterGoldStandard['preferredTerm_x'].where(logAfterGoldStandard['preferredTerm_x'].notnull(), logAfterGoldStandard['preferredTerm_y'])\n", + "logAfterGoldStandard.drop(['preferredTerm_x', 'preferredTerm_y'], axis=1, inplace=True)\n", + "logAfterGoldStandard.rename(columns={'preferredTerm2': 'preferredTerm'}, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\"\\nIf trying to clean up later outside normal flow\\n\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'SemanticGroup'] = 'Concepts and Ideas'\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'SemanticTypeName'] = 'Intellectual Product'\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'BranchPosition'] = 3\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'CustomTreeNumber'] = 124\\n\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticGroup'] = 'Concepts and Ideas'\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticTypeName'] = 'Intellectual Product'\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'BranchPosition'] = 3\\nlogAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'CustomTreeNumber'] = 124\\n\"" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Now we can add the missing columns after what we started above in Bibliographic Entity, Numeric Entity rows\n", + "\n", + "# FIXME - New, change as needed\n", + "logAfterGoldStandard['preferredTerm'] = logAfterGoldStandard['preferredTerm'].str.replace('Bibliographic Entity', 'PubMed strategy, citation, unclear, etc.')\n", + "\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.startswith('Bibliographic Entity', na=False), 'SemanticGroup'] = 'Unparsed'\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.startswith('Bibliographic Entity', na=False), 'SemanticTypeName'] = 'Unparsed'\n", + "\n", + "\n", + "'''\n", + "Old version\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'SemanticGroup'] = 'Concepts and Ideas'\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'SemanticTypeName'] = 'Intellectual Product'\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'BranchPosition'] = 3\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'CustomTreeNumber'] = 124\n", + "'''\n", + "\n", + "\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticGroup'] = 'Accession Number'\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticTypeName'] = 'Accession Number'\n", + "# logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Numeric Entity', na=False), 'BranchPosition'] = 3\n", + "# logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('Numeric Entity', na=False), 'CustomTreeNumber'] = 124\n", + "\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('NON-ENGLISH CHARACTERS', na=False), 'SemanticGroup'] = 'Foreign language'\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('NON-ENGLISH CHARACTERS', na=False), 'SemanticTypeName'] = 'Foreign language'\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('NON-ENGLISH CHARACTERS', na=False), 'BranchPosition'] = 0\n", + "logAfterGoldStandard.loc[logAfterGoldStandard['preferredTerm'].str.contains('NON-ENGLISH CHARACTERS', na=False), 'CustomTreeNumber'] = 000\n", + "\n", + "\n", + "'''\n", + "If trying to clean up later outside normal flow\n", + "\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'SemanticGroup'] = 'Concepts and Ideas'\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'SemanticTypeName'] = 'Intellectual Product'\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'BranchPosition'] = 3\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Bibliographic Entity', na=False), 'CustomTreeNumber'] = 124\n", + "\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticGroup'] = 'Concepts and Ideas'\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticTypeName'] = 'Intellectual Product'\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'BranchPosition'] = 3\n", + "logAfterUmlsApi2.loc[logAfterUmlsApi2['preferredTerm'].str.contains('Numeric Entity', na=False), 'CustomTreeNumber'] = 124\n", + "'''\n", + "\n", + "# Leaving pmresources.html rows to be updated " + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SessionIDStaffYNReferrerQueryTimestampadjustedQueryCaseAddressBranchPositionCustomTreeNumberEntrySourceResourceTypeSemanticGroupSemanticGroupCodeSemanticTypeNamecontentStewardpreferredTerm
0FCB8C84AEDB5855CDDB2F29E38C8C8D1Nwww.nlm.nih.gov/lichen ruber mucosae2018-07-30 07:48:01.000lichen ruber mucosaeNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1D052BA917FD4489BD63014BE6568670ENwww.nlm.nih.gov/molecular identification of marine bacteria2018-07-30 01:14:26.000molecular identification of marine bacteriaNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
247C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/serfile_addedinfo.htmlsecondaries brain prognostic factors2018-07-30 02:18:34.999secondaries brain prognostic factorsNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
30D3354A8E8C07196F17340B2C641487ENwww.nlm.nih.gov/nlmhome.htmlparasites2018-07-30 02:43:59.999parasitesNaN4.01113.0NaNNaNLiving Beings9.0EukaryoteNaNParasites
40D3354A8E8C07196F17340B2C641487ENwww.nlm.nih.gov/nlmhome.htmlparasites2018-07-30 02:43:59.999parasitesNaN4.01113.0NaNNaNLiving Beings9.0EukaryoteNaNParasites
50D3354A8E8C07196F17340B2C641487ENwww.nlm.nih.gov/nlmhome.htmlparasites2018-07-30 02:43:59.999parasitesNaN4.01113.0NaNNaNLiving Beings9.0EukaryoteNaNParasites
65EC71AEB4FDE600004405F91FD0F0379Nwww.nlm.nih.gov/bsd/pmresources.htmlvojta2018-07-30 06:19:02.000vojtaNaNNaNNaNNaNNaNNaNNaNNaNNaNpmresources.html
7D3DC089B24442B3796F18BD8581DDAE3Nwww.nlm.nih.gov/kaatsu2018-07-30 08:05:02.999kaatsuNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
847C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/htc2018-07-30 23:22:47.999htcNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
95BA856161F7799CF43DB5F82DB0FFE5DNwww.nlm.nih.gov/bsd/pmresources.htmlnedd82018-07-30 04:34:01.999nedd8NaNNaNNaNNaNNaNNaNNaNNaNNaNpmresources.html
109DE8BE123BD8A5C620F7F6E360C932F2Nwww.nlm.nih.gov/jour guilan uni med sci2018-07-30 10:07:09.999jour guilan uni med sciNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
1161D85C39CF12C86918334F7DE47F0D5CNwww.nlm.nih.gov/and am willing to bet there are not many other...2018-07-30 16:46:01.999and am willing to bet there are not many other...NaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
123AAEAD5B07167F3E471BD2EC27A08AD2Nwww.nlm.nih.gov/james d massie2018-07-30 18:33:34.999james d massieNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1347C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/pmresources.htmlwang be2018-07-30 00:53:34.000wang beNaNNaNNaNNaNNaNNaNNaNNaNNaNpmresources.html
1447C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/gerd2018-07-30 02:34:10.000gerdNaN6.0222121.0NaNNaNDisorders6.0Disease or SyndromeNaNGastroesophageal reflux disease
1547C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/gerd2018-07-30 02:34:10.000gerdNaN6.0222121.0NaNNaNDisorders6.0Disease or SyndromeNaNGastroesophageal reflux disease
163E9E6BC66C5F4149470C99ACDFF9F4C3Nwww.nlm.nih.gov/glu2018-07-30 10:52:28.999gluNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1700B419C908ABCED089BF5D2D9A895F28Nwww.nlm.nih.gov/bsd/pmresources.htmladaptive design clinical trials2018-07-30 11:54:57.000adaptive design clinical trialsNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
1847C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/pmresources.htmlresearch questionnaire for prevention of child...2018-07-30 10:53:47.000research questionnaire for prevention of child...NaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
196A98D34D328585DDB56B3A2D0021EA92Nwww.nlm.nih.gov/bsd/disted/nurses/intro_quiz.htmltranslation ebp guidelines2018-07-30 18:36:22.000translation ebp guidelinesNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
2047C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/special_queries.htmlms and constipation2018-07-30 19:37:15.000ms and constipationNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
21452F3BCA978146BBD4F31DA81671E335Nwww.nlm.nih.gov/bsd/pmresources.htmldigest resistant2018-07-30 19:37:27.000digest resistantNaNNaNNaNNaNNaNNaNNaNNaNNaNpmresources.html
2247C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/pubmed.htmlcollie+eye+anomaly+sweden+rough2018-07-30 09:49:08.000collieeyeanomalyswedenroughNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
234C4D9C88AC39838AF94C0E10513294FCNwww.nlm.nih.gov/index.htmlinternational journal of naval architecture an...2018-07-30 11:44:21.000international journal of naval architecture an...NaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
2447C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/news/genetics_tenth_anniversar...chd82018-07-30 14:59:07.999chd8NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2547C9DEE89B48E22FB53E2BE2DB107763Nwww.nlm.nih.gov/bsd/pmresources.htmlafrican american men social determinants of he...2018-07-30 23:59:07.999african american men social determinants of he...NaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
26D5354A41531899B74AB388E8A0B269E8Nwww.nlm.nih.gov/bsd/medline.htmlvan gurp, maria2018-07-30 14:52:16.000van gurp mariaNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2704D4C719A11BEE4018E1A44B2C4A3D71Nwww.nlm.nih.gov/alignx2018-07-30 20:20:37.999alignxNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
287598EE17908E487DECBDD2320CA9AA84Nwww.nlm.nih.gov/mesh/autohemoterapiy2018-07-30 17:20:12.000autohemoterapiyNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
290F8C16185894B53CC363E8CC446F8827Nwww.nlm.nih.gov/mesh/meshhome.htmlliver transaminases and cholangitis2018-07-30 09:02:01.000liver transaminases and cholangitisNaNNaNNaNNaNNaNNaNNaNNaNNaNPubMed strategy, citation, unclear, etc.
\n", + "
" + ], + "text/plain": [ + " SessionID StaffYN \\\n", + "0 FCB8C84AEDB5855CDDB2F29E38C8C8D1 N \n", + "1 D052BA917FD4489BD63014BE6568670E N \n", + "2 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "3 0D3354A8E8C07196F17340B2C641487E N \n", + "4 0D3354A8E8C07196F17340B2C641487E N \n", + "5 0D3354A8E8C07196F17340B2C641487E N \n", + "6 5EC71AEB4FDE600004405F91FD0F0379 N \n", + "7 D3DC089B24442B3796F18BD8581DDAE3 N \n", + "8 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "9 5BA856161F7799CF43DB5F82DB0FFE5D N \n", + "10 9DE8BE123BD8A5C620F7F6E360C932F2 N \n", + "11 61D85C39CF12C86918334F7DE47F0D5C N \n", + "12 3AAEAD5B07167F3E471BD2EC27A08AD2 N \n", + "13 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "14 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "15 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "16 3E9E6BC66C5F4149470C99ACDFF9F4C3 N \n", + "17 00B419C908ABCED089BF5D2D9A895F28 N \n", + "18 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "19 6A98D34D328585DDB56B3A2D0021EA92 N \n", + "20 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "21 452F3BCA978146BBD4F31DA81671E335 N \n", + "22 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "23 4C4D9C88AC39838AF94C0E10513294FC N \n", + "24 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "25 47C9DEE89B48E22FB53E2BE2DB107763 N \n", + "26 D5354A41531899B74AB388E8A0B269E8 N \n", + "27 04D4C719A11BEE4018E1A44B2C4A3D71 N \n", + "28 7598EE17908E487DECBDD2320CA9AA84 N \n", + "29 0F8C16185894B53CC363E8CC446F8827 N \n", + "\n", + " Referrer \\\n", + "0 www.nlm.nih.gov/ \n", + "1 www.nlm.nih.gov/ \n", + "2 www.nlm.nih.gov/bsd/serfile_addedinfo.html \n", + "3 www.nlm.nih.gov/nlmhome.html \n", + "4 www.nlm.nih.gov/nlmhome.html \n", + "5 www.nlm.nih.gov/nlmhome.html \n", + "6 www.nlm.nih.gov/bsd/pmresources.html \n", + "7 www.nlm.nih.gov/ \n", + "8 www.nlm.nih.gov/ \n", + "9 www.nlm.nih.gov/bsd/pmresources.html \n", + "10 www.nlm.nih.gov/ \n", + "11 www.nlm.nih.gov/ \n", + "12 www.nlm.nih.gov/ \n", + "13 www.nlm.nih.gov/bsd/pmresources.html \n", + "14 www.nlm.nih.gov/ \n", + "15 www.nlm.nih.gov/ \n", + "16 www.nlm.nih.gov/ \n", + "17 www.nlm.nih.gov/bsd/pmresources.html \n", + "18 www.nlm.nih.gov/bsd/pmresources.html \n", + "19 www.nlm.nih.gov/bsd/disted/nurses/intro_quiz.html \n", + "20 www.nlm.nih.gov/bsd/special_queries.html \n", + "21 www.nlm.nih.gov/bsd/pmresources.html \n", + "22 www.nlm.nih.gov/bsd/pubmed.html \n", + "23 www.nlm.nih.gov/index.html \n", + "24 www.nlm.nih.gov/news/genetics_tenth_anniversar... \n", + "25 www.nlm.nih.gov/bsd/pmresources.html \n", + "26 www.nlm.nih.gov/bsd/medline.html \n", + "27 www.nlm.nih.gov/ \n", + "28 www.nlm.nih.gov/mesh/ \n", + "29 www.nlm.nih.gov/mesh/meshhome.html \n", + "\n", + " Query Timestamp \\\n", + "0 lichen ruber mucosae 2018-07-30 07:48:01.000 \n", + "1 molecular identification of marine bacteria 2018-07-30 01:14:26.000 \n", + "2 secondaries brain prognostic factors 2018-07-30 02:18:34.999 \n", + "3 parasites 2018-07-30 02:43:59.999 \n", + "4 parasites 2018-07-30 02:43:59.999 \n", + "5 parasites 2018-07-30 02:43:59.999 \n", + "6 vojta 2018-07-30 06:19:02.000 \n", + "7 kaatsu 2018-07-30 08:05:02.999 \n", + "8 htc 2018-07-30 23:22:47.999 \n", + "9 nedd8 2018-07-30 04:34:01.999 \n", + "10 jour guilan uni med sci 2018-07-30 10:07:09.999 \n", + "11 and am willing to bet there are not many other... 2018-07-30 16:46:01.999 \n", + "12 james d massie 2018-07-30 18:33:34.999 \n", + "13 wang be 2018-07-30 00:53:34.000 \n", + "14 gerd 2018-07-30 02:34:10.000 \n", + "15 gerd 2018-07-30 02:34:10.000 \n", + "16 glu 2018-07-30 10:52:28.999 \n", + "17 adaptive design clinical trials 2018-07-30 11:54:57.000 \n", + "18 research questionnaire for prevention of child... 2018-07-30 10:53:47.000 \n", + "19 translation ebp guidelines 2018-07-30 18:36:22.000 \n", + "20 ms and constipation 2018-07-30 19:37:15.000 \n", + "21 digest resistant 2018-07-30 19:37:27.000 \n", + "22 collie+eye+anomaly+sweden+rough 2018-07-30 09:49:08.000 \n", + "23 international journal of naval architecture an... 2018-07-30 11:44:21.000 \n", + "24 chd8 2018-07-30 14:59:07.999 \n", + "25 african american men social determinants of he... 2018-07-30 23:59:07.999 \n", + "26 van gurp, maria 2018-07-30 14:52:16.000 \n", + "27 alignx 2018-07-30 20:20:37.999 \n", + "28 autohemoterapiy 2018-07-30 17:20:12.000 \n", + "29 liver transaminases and cholangitis 2018-07-30 09:02:01.000 \n", + "\n", + " adjustedQueryCase Address BranchPosition \\\n", + "0 lichen ruber mucosae NaN NaN \n", + "1 molecular identification of marine bacteria NaN NaN \n", + "2 secondaries brain prognostic factors NaN NaN \n", + "3 parasites NaN 4.0 \n", + "4 parasites NaN 4.0 \n", + "5 parasites NaN 4.0 \n", + "6 vojta NaN NaN \n", + "7 kaatsu NaN NaN \n", + "8 htc NaN NaN \n", + "9 nedd8 NaN NaN \n", + "10 jour guilan uni med sci NaN NaN \n", + "11 and am willing to bet there are not many other... NaN NaN \n", + "12 james d massie NaN NaN \n", + "13 wang be NaN NaN \n", + "14 gerd NaN 6.0 \n", + "15 gerd NaN 6.0 \n", + "16 glu NaN NaN \n", + "17 adaptive design clinical trials NaN NaN \n", + "18 research questionnaire for prevention of child... NaN NaN \n", + "19 translation ebp guidelines NaN NaN \n", + "20 ms and constipation NaN NaN \n", + "21 digest resistant NaN NaN \n", + "22 collieeyeanomalyswedenrough NaN NaN \n", + "23 international journal of naval architecture an... NaN NaN \n", + "24 chd8 NaN NaN \n", + "25 african american men social determinants of he... NaN NaN \n", + "26 van gurp maria NaN NaN \n", + "27 alignx NaN NaN \n", + "28 autohemoterapiy NaN NaN \n", + "29 liver transaminases and cholangitis NaN NaN \n", + "\n", + " CustomTreeNumber EntrySource ResourceType SemanticGroup \\\n", + "0 NaN NaN NaN NaN \n", + "1 NaN NaN NaN NaN \n", + "2 NaN NaN NaN NaN \n", + "3 1113.0 NaN NaN Living Beings \n", + "4 1113.0 NaN NaN Living Beings \n", + "5 1113.0 NaN NaN Living Beings \n", + "6 NaN NaN NaN NaN \n", + "7 NaN NaN NaN NaN \n", + "8 NaN NaN NaN NaN \n", + "9 NaN NaN NaN NaN \n", + "10 NaN NaN NaN NaN \n", + "11 NaN NaN NaN NaN \n", + "12 NaN NaN NaN NaN \n", + "13 NaN NaN NaN NaN \n", + "14 222121.0 NaN NaN Disorders \n", + "15 222121.0 NaN NaN Disorders \n", + "16 NaN NaN NaN NaN \n", + "17 NaN NaN NaN NaN \n", + "18 NaN NaN NaN NaN \n", + "19 NaN NaN NaN NaN \n", + "20 NaN NaN NaN NaN \n", + "21 NaN NaN NaN NaN \n", + "22 NaN NaN NaN NaN \n", + "23 NaN NaN NaN NaN \n", + "24 NaN NaN NaN NaN \n", + "25 NaN NaN NaN NaN \n", + "26 NaN NaN NaN NaN \n", + "27 NaN NaN NaN NaN \n", + "28 NaN NaN NaN NaN \n", + "29 NaN NaN NaN NaN \n", + "\n", + " SemanticGroupCode SemanticTypeName contentSteward \\\n", + "0 NaN NaN NaN \n", + "1 NaN NaN NaN \n", + "2 NaN NaN NaN \n", + "3 9.0 Eukaryote NaN \n", + "4 9.0 Eukaryote NaN \n", + "5 9.0 Eukaryote NaN \n", + "6 NaN NaN NaN \n", + "7 NaN NaN NaN \n", + "8 NaN NaN NaN \n", + "9 NaN NaN NaN \n", + "10 NaN NaN NaN \n", + "11 NaN NaN NaN \n", + "12 NaN NaN NaN \n", + "13 NaN NaN NaN \n", + "14 6.0 Disease or Syndrome NaN \n", + "15 6.0 Disease or Syndrome NaN \n", + "16 NaN NaN NaN \n", + "17 NaN NaN NaN \n", + "18 NaN NaN NaN \n", + "19 NaN NaN NaN \n", + "20 NaN NaN NaN \n", + "21 NaN NaN NaN \n", + "22 NaN NaN NaN \n", + "23 NaN NaN NaN \n", + "24 NaN NaN NaN \n", + "25 NaN NaN NaN \n", + "26 NaN NaN NaN \n", + "27 NaN NaN NaN \n", + "28 NaN NaN NaN \n", + "29 NaN NaN NaN \n", + "\n", + " preferredTerm \n", + "0 NaN \n", + "1 PubMed strategy, citation, unclear, etc. \n", + "2 PubMed strategy, citation, unclear, etc. \n", + "3 Parasites \n", + "4 Parasites \n", + "5 Parasites \n", + "6 pmresources.html \n", + "7 NaN \n", + "8 NaN \n", + "9 pmresources.html \n", + "10 PubMed strategy, citation, unclear, etc. \n", + "11 PubMed strategy, citation, unclear, etc. \n", + "12 NaN \n", + "13 pmresources.html \n", + "14 Gastroesophageal reflux disease \n", + "15 Gastroesophageal reflux disease \n", + "16 NaN \n", + "17 PubMed strategy, citation, unclear, etc. \n", + "18 PubMed strategy, citation, unclear, etc. \n", + "19 PubMed strategy, citation, unclear, etc. \n", + "20 NaN \n", + "21 pmresources.html \n", + "22 PubMed strategy, citation, unclear, etc. \n", + "23 PubMed strategy, citation, unclear, etc. \n", + "24 NaN \n", + "25 PubMed strategy, citation, unclear, etc. \n", + "26 NaN \n", + "27 NaN \n", + "28 NaN \n", + "29 PubMed strategy, citation, unclear, etc. " + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "logAfterGoldStandard.head(30)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# Save to file so you can open in future sessions, if needed\n", + "writer = pd.ExcelWriter(localDir + 'logAfterGoldStandard.xlsx')\n", + "logAfterGoldStandard.to_excel(writer,'logAfterGoldStandard')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# Decision point, whether to add semantic group (15 supercategories) data, to show process bars after GoldStandard??\n", + "# Or temp join\n", + "# Or move semantic assignments later to reduce processing load and here \n", + "# only show percent of preferredTerm assignments.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAD7CAYAAABdXO4CAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJztnXd4W+XZ/z+3vEdsk0WcRSAkTkwCSYBAgWBmS8AtUGgpq2UWOn4tvKUUuhxBR2j7lhYoLS9QRlo2tICLgRISs0kAhxBMnL294iFvW7ae3x/nOAhjS7Yj6cjS/bkuX5b0nPE9R0dfPbqf59y3GGNQFEVRnMfltABFURTFQg1ZURQlSlBDVhRFiRLUkBVFUaIENWRFUZQoQQ1ZURQlSlBDHuGIyPEislFEWkTkHKf1DAYRMSJy6ABtl4nIG5HW1B+h1iIiJ4nIrlBtLxoRkUUiUuG0jpFK3BqyiJwgIm+JiEdE6kXkTRE52m4b0gdRRKbZJpMYPsUDcgtwlzEm0xjzbxFZKSJXhXIH9jZP8ns+Q0QeE5FaEWmyvxDuFJHJodyvva+zRWSNvZ+9IrJcRKbZbUtE5B+h3mekEJFtvccSKxhjXjfG5DmtY6QSl4YsIllAMXAnMBqYBLiBTid1DZODgI9DtTERSQjSfijwLrAHmG+MyQKOBzYDJ4RKh9++HgZ+BGQDBwN3A75Q7iccRPLL2aGOgBIOjDFx9wccBTQO0DYb6AB6gJbe5YCzgDKgCdgJLPFbZwdg7OVbgC8AS4B/+C0zzV4m0X5+GbAFaAa2AhcPoGch8DbQCFQCdwHJdttmLHNqt/f7W1t3h/38Lnu5WcB/gXqgAvi63/YfBP4KvAC0Aqf1o2ElcJL9+B/A84M4x1cDm+x9PgdM9GszwKH24zF2exOwCrgVeMNuOx9YM8D2zwC6AK99rB/ar18OfGKf1y3ANX7rnATswjL4Gvt8Xu7XPqAWu/3P9nvfBLwPLPJrWwI8ZZ+fJuAqIM0+vw1AOfBjYJffOtuAaQMc30r7/VwFeIBngdF9rqUrsa691+zXv4L15dxorz/bb3tTgGeAWqCu99qw266wz1kD8BJwkP26ALfb58oDrAXm2G1n2sfUDOwGbvA/x32O8QZ7XQ/wOJDq136j/T7ssc/ZvmsjHv8cF+DIQUOWfVE+BCwGDujTfpn/B9F+7SRgLtavisOBauAcu633A5Lot/wSBjBkIMP+0ObZbbnAYQNoPRI41l5vmv3Buc6vfRt+Jmp/EK/ye56BZSKX29tYAOzt3R+WYXiwerku/w/LAHqqgMuCLHOKvY8FQArWL5HX/Nr9Dfkx4Alb5xz7w91ryIdgfbncDpwMZPbZz2fOsf3aWcB0LDMpANqABX7vYTdWmCcJy1Taet//QFrs9kuwTDsRy9Sres+XrcULnGOfxzRgKfA61q+wKcA6/MwqyDlcae9/jq3n6d5j9buWHrbb0oCZWF+op9vHdiPWF2IykAB8aJ/HDCAVOMHe1jn2crPt4/o58Jbd9iWsL54c+3zOBnLttkrsLyTggD7nuK8hrwIm2ufhE+Bau+0M+xweBqQDy1BDdl6EIwduXVwPYvWYurF6RgfabZfRx5D7Wf9PwO32494PyFAMuRE4D0gbou7rgH/5Pd9GYEO+AHi9zzbuAYrsxw8CDw9h/93AGX7Pv28fSwtwr/3a/cDv/JbJxDKrafZzAxxqG4UXmOW37G/4rAkei2WStVjm/CC2Mfc9xwPo/TfwQ/vxSVi/Jvzfpxp7H0G19LPtBuAIPy2v9Wnf0udcfZuhGfJSv+f5WL8IEvyupUP82n8BPOH33IVl6Cdh/WKr9T9uv+VKgCv7rNeGFQo7Bdhgnx9Xn/V2ANcAWX1eP4nPG/Ilfs9/B/zNfvx34Ld+bYcS54YclzFkAGPMJ8aYy4wxk7F6IROxTLZfROQYEVlhD2R5gGuBscPcdyuWUV4LVIrIf0Rk1gD7nSkixSJSJSJNWCYxlP0eBBwjIo29f8DFwAS/ZXYOYXt1WD363mO5yxiTg3XukuyXJwLb/ZZpsdeb1Gdb47C+oPz3v91/AWPMO8aYrxtjxgGLgBOBnw0kTkQWi8g79kBtI1Yv2P981Rljuv2et2F9YQTVIiI/EpFP7IHgRqy4tv+2+57HiYG2Nwj6rpsUYH99z7nPbp+E1Tvf3ue4ezkI+LPftVGP1RueZIx5FStE9hegWkT+zx5/AaszcSawXURKReQLAY6jyu9x7/nu1ex/DEO5DmOSuDVkf4wx67F6XnN6X+pnsUewetFTjDHZwN+wLtyBlm/F+hnWi78BYox5yRhzOpa5rQfuHUDeX+32GcYaQPup3377PZw+z3cCpcaYHL+/TGPMdwKsE4jlwFeDLLMH64MOgIhkYP3U391nuVqsHvcUv9emDrRRY8xqrDhov++TiKRg/bT/A9avnRys2Hig8zUoLSKyCPgJ8HWsEEcOVqjHf9t9z2PlQNsbJH3X9WKFgvrbX99zLvb6u7GugakDDP7txIqz+18facaYtwCMMXcYY47ECivMxIqDY4xZbYw5GxiP9SvkiSEeG1jnx39mzpSBFowX4tKQRWSW3duZbD+fAlwIvGMvUg1MFpFkv9VGAfXGmA4RWQhc5NdWizW4dojfa2uAE0VkqohkAzf77f9AEfmKbVSdWD/3ewaQOwor3txi96K/M8ByvVT30VEMzBSRS0Ukyf47WkRmB9nOQCwBFonIH0Vkkn08Y7FCQL08AlwuIvNsk/wN8K4xZpv/howxPVgGu0RE0kUkH/hWb7s9NfFqERlvP5+FNXDl/z5NE5He6zgZK2ZdC3SLyGLgi4M5qGBasN6HbnvbiSLyS6yxiEA8AdwsIgfY19r/G4wWPy4RkXwRSceKez9l6xxoX2eJyKkikoQV4+4E3sKK4VYCS0UkQ0RSReR4e72/2RoPAxCRbBH5mv34aPuXYRJWB6MD6BGRZBG5WESyjTFerOtzIF2BeALrOpltH+Mvh7GNmCIuDRlrZPgY4F0RacX6gK/DuogBXsUara4Skd4eyXeBW0SkGevC2dcjMMa0Ab8G3rR/+h1rjPkv1ojyWqyBkWK//bvsfe3B+olYYG+/P27AMv9mrF7040GO7c/A+SLSICJ3GGOasUzpG/b+qoDbsIxryBhjemOKk4EP7fPxpr3tX9jLLLcfP41lBNPt/ffH97F+wlZh/Up5wK+tEcuAPxKRFuBF4F9YcUiAJ+3/dSLygX2sP8B6bxqwzttzQzi8QFpewoq3bsAKDXQQ/Ce22152K/Ay1qDVUFhm66jCGoj7wUALGmMqsAYd78TqRX8Z+LIxpss28S9jxWh3YI2bXGCv9y+s6+ExOyS2DmugG6wvnHuxzuV2rLDTH+y2S4Ft9jrX2vseEsaYEuAOYAXWwOLbdtNInH4aEsQOpiuKEkWIyEqsAcv7nNYSKexfbeuAlAHi3TFPvPaQFUWJAkTkXDsEcgBWT/35eDVjUENWFMVZrsGKy2/GikMHGyOJaTRkoSiKEiVoD1lRFCVKUENWFEWJEtSQFUVRogQ1ZEVRlChBDVlRFCVKUENWFEWJEtSQFUVRogQ1ZEVRlChBDVlRFCVKUENWFEWJEtSQFUVRogQ1ZEVRlChBDVlRFCVKUENWFEWJEvoreqgoIwKP291bPHUMVjXmVKxCoF6gy/7r+7gNqMkuKtK8s0rUofmQlajE43ZPwCqcOsv+m4BlumP8/qcOc/NdWHXltmPVmPP/vx3YkV1UFLd13RTnUENWog6P270DZ0vCG+ATrOK372AV3yzPLiryOahJiQPUkJWIsbTMm3rT/KSO/toW5+WlAV8ADr7jK19ZMj4zc3Jk1QWlCVjFpyb9TnZRUZ2zkpRYQ2PISlhYWuadDBwHLADm238vAt8cYJWZWPXV6qpbWuqj0JCzgNPsPwDjcbtXA08DT2cXFW12TJkSM6ghKyFhaZk3BTgROMMYc4aI5Pez2NwAm6gBWoCa6ubm3XMnTDg8HDpDiAAL7b/bPG73h3xqzuWOKlNGLBqyUIbN0jLvTOAMLBMuEJH0IKt0Apk3zU/6XJn3xXl5ScA9wM5TDz30oKsXLvxW6BVHjPXAM8BT2UVFZU6LUUYOasjKkFha5p0DXGGMOUdEDh7GJvJvmp/0SX8Ni/PyfgWkjM/I8N1x9tk37pfQ6GEV8GfgyeyiIq/TYpToRkMWSlCWlnkzgQt9Pd3XuhISFwCIyHA3NxdrBkN/bAWOrGltrWnr6mpOT04eNdydRBELgX8Cv/e43XcD92QXFe11WJMSpeidesqALC3zHveb9zv+bny+auD/es14PwkUR94MpAHUt7fXhGBf0cRE4FfATo/bfZ/H7Z7jtCAl+tAesvIZlpZ5c4wxVxhfzzWuhMSZLldCqHcRaLCuGugBqGlpqZ6cnT091DuPAlKBK4ErPW73q8DvsouKXnJYkxIlqCErACwt82Z3e7t+7EpI+KHLlZApCWG7NAL1kKuxf7Xt8nhqFkyaFC4N0cIpwCket/u/wA3ZRUVrnRakOIuGLOKcpWXezFvebLi1p9u7KzEp+WcuV0JmmHc5zY5J90cD1m3NiRv37q0Os45o4nSgzON23+9xu3OdFqM4hxpynLK0zJtxy5sNRT1e7+7k9MyfJyQmhduIexGg3/hpSUWFwcopkfFRVVWtL76mALmAK4CNHrd7iZ04SYkz1JDjjKVl3rRb3mq8ucfbtTs5PXNJQlJSlgMygg3sZXR0d/c0dXTE463JGUARsMHjdl/hcbv1MxpH6JsdRxS9Vntxd1fnjuS0jN8kJCVnOyglkCFvxx7bqGtri6ewRV8mAvcD73vc7nlOi1EigxpyHHDzi1vzfrGialXaqJx/JCanjHVaD8EH9gxAVXNzrE19Gw7zgFUet7vI43brIHyMo4Ycw/z4ufWJN7+49fbMMRPWZeSMOdppPX4Ey2khADsaG+O5h+xPErAEy5gDnTtlhKOGHKNc/9SHx6dnj96cfeDk6xISE6OtZzVmaZl3Yn8NJRUVrYAHSFlfW6s95M8yH1jtcbuv97jdw75VUole1JBjjGsfKE368fMV946blvdaWtYBU53WE4BAPb1tQMaG2toGb09PV4T0jBRSgD8CL3jc7gOdFqOEFjXkGOLq//vvvPEH520eM/mQq1wJCdH+3gYy5E1AhgEa29trI6RnpHEGsNbjdp/utBAldET7h1YZBPkFhXLl3S9cN/XwY95Jzx7jZOmjoRDIkHdjx5FrW1s1jjww44ESj9v9XaeFKKFBDXmEk19QmHrylTc/degxp/4xKSUtxWk9QyDYwJ4PYE9TkxpyYBKAv3jc7j/pnOWRj76BI5iv3Hj7wWf96A8fTJm78Kvico20QZ7ZS8u8A2UuqsW6NmVrQ4MO7A2OHwLPetzuSN1xqYQBNeQRysW/e7Rw3pkXfTBmyvTZTmsZJqlYdfQ+R0lFhReoBNLXVVVpD3nwFAKve9zumM/KFKuoIY8w8gsKXVf+teRXswu+/Ex69ugcp/XsJ4HCFluAjOqWlvZ2r7clUoJigN4bSUKRu1qJMGrII4j8gsKMEy69/okZx572s8TklCSn9YSAYIacCnF/C/VwmAi85nG7v+y0EGVoqCGPEPILCnOPv+gHLxxy5InnOa0lhAQy5CrsW6hrWlo0jjx0MoCnPW73V5wWogweNeQRQH5B4ZRFl17/3PSFJ5/otJYQEyynhQDs8ni0hzw8koAnPG73F50WogwONeQoJ7+gcFrB5Tc+f/CRJx7ltJYwcLAmqw87KcC/PW73SU4LUYKjhhzFzDnlnENPvuqnxQcd8YUjnNYSJgQ4rL8G/2T1H1dX742zZPWhJg143uN2H+e0ECUwashRyuGnnz/7lKt/9p8pc47u17BiiEBFTzcDGW1eb3dzZ2c8JqsPJZlY+S+OdFqIMjBqyFHI4aefN/eUq3/6/MRZ8/qdpxtjBEtWnwRQ19qqA3v7TzbwsqbwjF7UkKOM/ILCecdd+P1HJ8yYO91pLREi2C3UPQBVLS0aRw4No4FXPG53vFxfIwo15Cgiv6Bw7rwzL7p72vwTYj1M4U+wmRYJADv0FupQMh5roE8LqUYZashRQn5B4dTpC0+5bc5p5x3jtJYIM2ZpmTe3vwY7WX0jkPJJba32kEPLHKyafUoUoYYcBeQXFB4wYebhtx5z/rdPdblc8fieaLJ6Z7jA43b/yGkRyqfE44c/qsgvKEzNGj/p5hO/+T/nJianJDutxyE0Wb1z3OZxu09xWoRioYbsIPkFhQnJ6ZnfPvmqmy9Lzcwe5bQeBwk09W1P7wNNVh8WEoDHPW53NJf7ihvUkB0iv6BQgHNPuuIn12ePnzTOaT0OE2xgD4A9zc06sBcexmLlvUh1Wki8o4bsHCcc+ZVv3Tjh0DnTnBYSBQRLVi+AbKuv1x5y+DgKuNtpEfGOGrID5BcUzpwwY+4Nswu+rDlrLVKBGf01fCZZfXW19pDDy+Uet/tcp0XEM2rIESa/oDArKTX9/51wyXXHuxISBuoVxiOBwhZbgYyq5uY2TVYfdv7icbuznRYRr6ghR5D8gkIX8M0TLrnu1PTs0WOc1hNlBDLkzWiy+kiRC/zeaRHxihpyZFk0/eiTz54y5+iRWgcvnGiy+ujhKk3X6QxqyBEiv6AwNzUz+4qjzr1iodNaopRAU9/2mbAmq48IAtzrcbvTnBYSb6ghR4D8gsJE4MoTLr1uYUp6ZpbTeqKUYMnqu4GETXV12kOODIcCS5wWEW+oIUeGU6cvPOXEiXnzZjktJIoJlKzeh52sfl1VVa0mq48YP9Lq1ZFFDTnM5BcUTnIlJF6woPBSvbCDE+wW6kw7WX19pATFOQnA/R63W2cDRQg15DBiz6q4fP5ZF09Py8rRWRXBCWTIO4BEgDq9hTqSzAMudlpEvKCGHF7mp2Rm5c88/oxYLFAaDoLdQu0DTVbvAEUetzvRaRHxgBpymMgvKEwBLj7m/GumJ6WkpjutZ4Sgyeqjk0OAK5wWEQ+oIYePgpyJB02ZOnehxo4Hz9ggyeo9QMp6TVbvBD/3uN0pTouIddSQw0B+QWEWcN6xX7t2tishUX/qDY2gyeorrGT13gjpUSymANc4LSLWUUMOD2dNzj9qwrhpefFUGy9UBLuFOt0AjR0dGraIPDd73G4Nv4URNeQQk19QmAucftS5l88TEafljEQCGfIurPnK7G1tVUOOPBOA7zstIpZRQw4hdtL58w9esGhs1riJWoFheAQy5BrsnBa7m5o0juwMN3rc7niubhNW1JBDywzgqPxTzuk3t68yKPIDJKuvwbpmNVm9c4xBZ1yEDTXk0HLW6CnTXaMnTctzWsgIJhUrj8LnsJPVV6HJ6p3maqcFxCpqyCEiv6BwAnD4vMUXThdxafB4/wiU+W0LmqzeaQ7zuN3HOy0iFlFDDh0npaSPYsKMuTrveP8JNtMiDaC+rU17yc6hveQwoIYcAvILCjOAU44488LcxKRknTy//wS7Y68HoEZvoXaSr3vc7hynRcQaasih4RiQxGnzjtOcFaEhmCELwK6mJu0hO0cacInTImINNeT9JL+gMAEonLVocVZqZrZmdAsNhywt82YM0LYvWf3GvXu1h+wsGrYIMWrI+89hwAEzj//SEU4LiSE0Wf3I4HCP232M0yJiCTXk/cC+EeSsjAPGdWePn3yI03pijEBhiy1Ahiarjwq0lxxC1JD3j4nAzNknnjVeXC49l6El0NS3bUASQF1bm4YtnOVsj9ut136I0BO5fxwOmNy8ebOdFhKDBBvYMwBVzc06sOcsY4GjnRYRK6ghDxM7XLEodVROS/aBGq4IA8FyWgjAjsZG7SE7z5lOC4gV1JCHz4FA7qxFZ050JSRoEcjQM3ZpmXdCfw0lFRUtQBOQUlFbqz1k51FDDhFqyMPnMMBMmr0g32khMUygXvJ2IGN9TU29Jqt3nCM9bvd4p0XEAmrIw2dRUmp6S07u1H4T4SghIZAhb0KT1UcLApzhtIhYQA15GOQXFI4FDpq16MzxCYlJSU7riWECGfJuNFl9NKFhixCghjw88gEm5R85y2khMU6gqW/7Zlrs0WT10cAXPW73kMZSRGSaiKzr89oSEbkhtNI+t99bROS0MO/jMhG5a6jrqSEPjxOAppwJUw52WkiMM6hk9Vvr67WH7DwHAMc6LWIwGGN+aYx5xWkd/aGGPETyCwpHAYeOmXqoLzktI8tpPTFOsGT11UDax9XV2kOODkKWI1lEVorIbSKySkQ2iMgi+/VpIvK6iHxg/x1nv54rIq+JyBoRWScii0QkQUQetJ9/JCLX28s+KCLn24/PFJH1IvKGiNwhIsX260tE5O+2ji0i8gM/bZfYutaIyD0ikmC/frmttXS450INeegcBJgpc47RmnmRIVhu5MxKK1l9a6QEKQNyZIi3l2iMWQhcBxTZr9UApxtjFgAXAHfYr18EvGSMmQccAawB5gGTjDFzjDFzgQf8Ny4iqcA9wGJjzAnAuD77nwV8CVgIFIlIkojMtvd7vL2vHuBiEckF3FhGfDp2WHOoqCEPnUMBM27aTDXkyBAsp0UqQL3eQh0NDNWQB0oM1fv6M/b/94Fp9uMk4F4R+Qh4kk+NbzVwuYgsAeYaY5qxro9DROROETkDa+66P7OALcaYrfbzR/u0/8cY02mM2Yv1RXAgcCrWca4WkTX280OAY4CVxphaY0wX8PhgTkBf1JCHzlygKWv8pMlOC4kTAhlyFeADTVYfJUwfYtL6OqzYsz+jgb324077fw+QaD++HitUdQRwFJAMYIx5DTgRa/bNMhH5pjGmwV5uJfA94L4++wpWaq3T73GvBgEeMsbMs//yjDFL7GX2O/OgGvIQyC8oTAGmJaaktadl5ehE+MgwqFuoNVl91DDoEmbGmBagUkROBRCR0Vjzmd8IsFo2UGmM8QGXAr3x24OAGmPMvcD9wAIRGQu4jDFPA7/oR9t6rB70NPv5BYOQvRw4X0TG92q29/0ucJKIjBGRJOBrg9jW51BDHhqTACbnHznO5UrQcxcZAiWrr8dOVr9Jk9VHC/3msQ7AN4Gf2z//XwXcxpjNAZa/G/iWiLwDzAR6xw5OAtaISBlwHvBnrM/rSnvbDwI3+2/IGNMOfBd4UUTewOp5ewKJNcaUAz8HXhaRtcB/gVxjTCWwBHgbeAX4YDAH35fE4IsofuQCrvGHzD7QaSFxhAvrQ76qb0NJRYVvcV7eTmD0uurqWmOMERGt+O0sQzJk2+BO7uf1k/we78WOIRtjNvLZ+ek3268/BDzUzy4+12M3xlzm93SFMWaWfd38BXjPXmZJn3Xm+D1+nH5ixMaYB+gzcDhU1JCHxkygIyd3aq7TQuKMufRjyDabgSmtXV1NzZ2d9VmpqSO+jNbc229nVEoKLhESXS5WXnMNDW1tXP7UU+xobGRqTg4Pfu1r5KSl8Wx5Ob9dsYID0tL45ze+wej0dLbW13Pr8uX8/WvD+tW8vwy1h+w0V4vIt7Bi0WVYsy4cQw15aMwEWjJyxox1WkicESiOvI1Pk9XXxIIhAzz/rW8xJuPTSM3tb7xBwcEHc/2iRdz++uvc/sYbuE8/nb+89Rb/veoqnlm3jic/+ohrjjmGX736Kj875RSnpI+oZFvGmNuB253W0YvGQQdJfkFhKta0l7bk9Ey9ISSyBBvY8wFUNjfHbBz5hYoKLpw3D4AL583jP+vXA+ASoau7mzavlySXi7e2b+fAzEymj3Hse2m0x+3uO3NCGSRqyIPnAMAnLhdJqelqyJElWPUQF8ROsnoR4dxlyyi45x4efO89AGpaWpgwahQAE0aNorbVGsv6yUkn8dV//IOVW7Zw3ty5/OG117ixoMAx7TZ9b7BQBomGLAZPNkDOhKmZOsMi4oxbWuY98Kb5Sf0ZbivQAiTHSrL6l664gtysLGpbWjhn2TJmjB04Qnby9OmcPH06AI+sWcPpM2awce9e7nzrLXLS0lh6xhmkJydHSnov44ENkd5pLKDGMniyAFfOhCnaO3aGfjO/lVRUGGArkFlRW9vQ7fON+GT1uVnWJTYuM5PCWbP4YPduxmdmUtXcDEBVczPjMj47E7Ctq4tH16zhqqOPxr18OXedfTbzcnN58qOPIq4f7SEPGzXkwTMaYNS43GynhcQpwZLVZ/iMMY3t7bWREhQOWru6aO7s3Pd4xebNzB4/nsV5eTy6Zg0Aj65Zw5l5eZ9Z789vvsm1xx5LUkICHV4vIoJLhDavI99PasjDREMWg2cC0JlxwLhJTguJU4IlqwegtrW1emxGxsQI6AkLtS0tXPy4NcW1x+fj/LlzOW3GDBZMmsRlTz7JsrIyJmdn85DflLbKpibW7NnDzSdb03m/f9xxnH7ffWSnpvLPb3zDicNQQx4masiD50CgIz17tIYsnCHYwJ4BqGxqqpk9fuTe1T5t9Gje/M53Pvf66PR0nvvWt/pdJzcri8cvvnjf83MOO4xzDnN0OvDIfQMcRkMWg2cs0Jmama0hC2fIX1rmHeh6rcXKaSFb6utjYqbFCEd7yMNEDXkQ5BcUurBiyJ3JaRmZTuuJU9IYOFl9F9Z8ZE1WHx2oIQ8TNeTBMcr+byQhYUh1w5SQEihssRXI0GT1UUFM3C3pBGrIgyMdO0Yp4tJz5hyBip5uwupFa7J659EET8NEzWVw7DtPIqLnzDmCDexZyepbW2PiBpERTLfTAkYqai6D41NDdmkP2UGCGbKVrN7j0R6ys6ghDxM1l8GxL26sPWRHOWRpmTd9gLZ9yeo31NZWRVCT8nnUkIeJmsvg8AtZaA/ZQVzAnP4aSioqfMAuIGNtVVWtz+fzRVSZ4k+P0wJGKmoug0NDFtHDvABtG4DMzu7unsaOjhF9C/UIR3vIw0TNZXB8ep40ZOE08wO0bcW++7SmpaUyMnKUflBDHiZqLoPDf5aFTulxlkA95CrsmRa7mpo0juwcasjDRA15cOw7T77u7i4nhSgcHuAW6iqs90o27d2rPWTnUEMeJmrIg0PsP7wd7XoXmLOkA3n9Ndi3UFcC6Wv27KkyxkRUmLKPeqcFjFTUkAdHJ/adet7OtjYnBNw09mN1AAAZbklEQVR21gz+9PX53PGNo7jr4mMBaPPUc/93FvOHs/O5/zuLaW9qAGDd8me4/fwjuOeKk2ltrAOgbudmHr3p4gG3P8IIFEfeBGQ2dnR0NXd2qjE4wy6nBYxU1JAHxz4T7mpvc6yHfPU9/+UHj73H9//5DgClD/yO6QtP5oZny5m+8GRWPvA7AF5f9ie++9AbzC+8hA9ffAyAl+8u4vTvLnFKeqgJZMgbgRSAva2tGkd2hp1OCxipqCEPjlbskEVnW3PUhCzKS59nQeGlACwovJTylc8BIC4X3V2deDvaSEhMYusHbzBqbC5jp85wUm4oCWTI+wb29jQ3qyE7gxryMFFDHhxt2Oeqs7XZkZCFiPD3753JnRcdw6qn7wOgpa6GrHG5AGSNy6Wl3pp6e+q3f87fv3cWm95dzhFfuoAV9/2WU6/+qROyw0WgmRaV2O/Vlro6HdhzBjXkYaIVQwZBeWmxN7+gsANI6GxpcqSHfO0DK8kaN5GW+hru/85ixk3rd1wLgBnHnsaMY08D4P3nHybvhDOo3baB15f9kbSsAyi84Y8kpw10B/KIYMzSMu+Um+Ynfe6DX1JR0bo4L68BSF1bVaU9ZGdQQx4m2kMePE1AcntzgyM95KxxVpm4zNHjOezks9n58Woyx4ynqdbqBDbVVpI5+rN5wbva2/jg+X9w7Neu5aW7fs55RfcyafYC1pQ8GnH9YSBQ2GIzkLnL42lp93pbIiVIAaApu6ioyWkRIxU15MHjAZJaG+si3kPuam+ls7V53+ON77zCgdMPY/aJX+aD4mUAfFC8jPyCL39mvdce+gPHX/R9EpKS8Ha0IyKIuPB2OPKdEmoCGfIGrOlxOrAXeXSGxX6gIYvB0wjkeqp2Rvzbv6WummU/sqoM+3q6mXfGN8g7/ktMPuwoHv3JRbz37wfJmTCFi373ac+3qXYPu8o/4LRrfwnAokuv4+5vnUDaqBwu+eNTkT6EcBDIkPdgD+xVNjdXTsnJ6bf0kxIWNFyxH6ghD54GINlTvauuu6uzPTE5JS1SOx49+RB++Pj7n3s9I2cMV93zUr/rZI2byGV3/Hvf87mnn8/c088Pm0YHCGTIldizYrY3NFQtnDIlMooUsOaBK8MkIiELETlXRIyIzBrGureIyGnh0OW3j8tE5K4gi9UCyQDtTfWaScx5pi4t844eoK0R6ACSymtqNGQRWVY5LWAkE6kY8oXAG8A3hrqiMeaXxphXQi9pyOwrEdTasFcNOTrod/pbSUWFAbYAmetrauq7eno6IysrrlFD3g/CbsgikgkcD1yJbcgikisir4nIGhFZJyKLRCRBRB60n38kItfbyz4oIufbj88UkfUi8oaI3CEixfbrS0Tk7yKyUkS2iMgP/PZ/iYissvd1j4gk2K9fLiIbRKTU1heMfXXammr3aM226CDYwF6GQYueRgpjjAeocFrHSCYSPeRzgBeNMRuAehFZAFwEvGSMmQccAazB6u1MMsbMMcbMBR7w34iIpAL3AIuNMScAn53jBbOALwELgSIRSRKR2cAFwPH2vnqAi0UkF3BjGfHpQP4gjqOuV8reHZv0Ax4dBDLkHdhx5GrNjRwRRGR1dlGRZnTaDyJhyBcCj9mPH7OfrwYuF5ElwFxjTDPWT8xDROROETkDa96vP7OALcaYrfbzvpNp/2OM6TTG7MXqzR4InAocCawWkTX280OAY4CVxphaY0wX8HiwgygvLe7Gui03fde61ZXG+PTCc55gt1ADsKOxUePIkUHDFftJWA1ZRMYApwD3icg24MdYPdbXgROB3cAyEfmmMaYBq7e8EvgecF/fzQXZnX+csAdrBokADxlj5tl/ecaYJfYywzHUzUBmR4unq6OlqS7o0kq4yVta5h1otsterOsgoaKmRnvIkeFdpwWMdMLdQz4feNgYc5AxZpoxZgpWmZ0TgRpjzL3A/cACERkLuIwxTwO/ABb02dZ6rB70NPv5BYPY/3LgfBEZDyAio0XkIKwL5yQRGSMiScDXBnk8FdiZxFrqqvVD7jwJwOH9NZRUVPQA27GLnvb4fFp4M/xoD3k/Cfc85AuBpX1eexp4EGgVES/QAnwTmAQ8IJ/WrLvZfyVjTLuIfBd4UUT2Mog33xhTLiI/B162t+sFvmeMeccOl7yNNWf1A6wPdzAqsXvWDXu27x43LW/uINZRwst8Bu6ZbQQO6urpaWrs6Kgdk54+IYK64o2d2UVFGhraT8JqyMaYk/p57Q7gjgFW6dsrxhhzmd/TFcaYWXZdu78A79nLLOmzzhy/x4/TT4zYGPMAfQYOB0HvDQey/cO3N8887otDXF0JA4Eyv32m6Kkaclh5w2kBscBIy2VxtT049zGQjTXrImKUlxa3Yw/sVVas2dvZ1twYyf0r/TKo3Mi7PB7tvYWXZ50WEAuMKEM2xtxuD87lG2MuNsY4kSVnNXAAQP2urXqbqPPMXVrmHSjcVI39i2ajFj0NG8aYTuAFp3XEAiPKkKOEj7FnfOz5pGyjw1oUSMOaEvk5/IueflhZWW206mm4WJ5dVNTstIhYQA156GzDnk616d3lW309PTp67zxBi556tOhp2BCRfzmtIVZQQx4i5aXFXcBa4IDOtmZvU+2ebQ5LUgZZ9LRWcyOHHGOMD3jOaR2xghry8HgPOwF6zdZPNGzhPIMretrUpIYcet7MLirS3C4hQg15eGzEjiNv++ANHdhznkEVPd1cX68DeyFGRJ5xWkMsoYY8DMpLi+uwPuiZVRs/quto8Whs0lkOWFrmPai/hpKKijagHkhdW1mpPeTQo/HjEKKGPHzeAXIAKivWfuiwFmUQRU/3NDW1tnV16WyAEOEz5t3soqLtTuuIJdSQh0859vlbt/yZMuPT7G8OM7iip21t2ksOES6RO53WEGuoIQ+frVilgtIb9mxrbqjcscFpQXHO4IqeNjVpHDkEdPt8dcCTTuuINdSQh0l5aXEP8CIwFmDLeys/X4VUiSTBip66ALZrbuSQIHB3dlFRl9M6Yg015P1jtf3f9Unp85u62lv7JtVXIsfkpWXeMQO0eYBWIOnj6mo15P3EZ0x3gsv1V6d1xCJqyPtBeWlxA/A+MM74fGbP+jUfOK0pzum3l+xX9HTUhtraBi16un90+3z/zi4q0tBPGFBD3n9WAKmgg3tRwGCLnmoveT9ITkj4X6c1xCpqyPtPBfbgXv2uLU2NVTv1zj3nCGTIu3ofVDU3a+9umHT19KzNLip6x2kdsYoa8n5iD+69hD24t+nd5asDr6GEkWADe4AWPd0fEl0u7R2HETXk0NBbTsr1Senzm1rqa3Y7qiZ+mbm0zJsxQNunRU9ra9WQh0FXd/d2l8gjTuuIZdSQQ0B5aXE99uAewMcrnn3VWUVxi4uBi576sFKnZq6trNSip8PAwI3ZRUXdTuuIZdSQQ0cxVrJ0qXj9hS1NtZV6S6kzBEo0tAHI9Pp8vsb2ds1QNgRaOjvXjb/11iec1hHrqCGHiPLS4u1Y1Y8nAHz036e0l+wMgeLI27Cri1e3tOjA3hCwK74rYUYNObQ8CyQDrs2rXt3RWLVzs9OC4hAtehpimjo6Xp3461+/7rSOeEANOYSUlxbvwSqHPgHgwxcf115y5Jm7tMybOECbFj0dIj5jfMmJido7jhBqyKGnGEgEEraveXNP/a6t650WFGekALP7ayipqPACu4GMNVr0dFA0dXQ8euCtt1Y4rSNeUEMOMeWlxdVYd+9NACh74Z8rjNG79yJM0KKnzZ2d3iYtehqQ7p6ejrSkpP9xWkc8oYYcHl7AOreJu8vfr6nauE4zwUWWYIacDFDb2qphiwA0dXbeNv7WW3U2SgRRQw4Ddomnl4FcgDf/eccrXR1tLc6qiisCTX37tOipDuwNSGN7+6bR6em3OK0j3lBDDh8lQAeQ3uap6yxf8VyJ04LiiGCGrEVPA9Dt8/XsaWq6OLuoyOe0lnhDDTlMlJcWNwHLsGPJa196vLxhz3ZNPBQZcpaWeQ/ur8EueloHpH2oRU/7ZWdj4z1fuPvuVcGXVEKNGnJ4WQWsAw4EeOvRO4u7vV2aizcyBC16WtXc3Nba1aVFBfyoa23d2u71/sBpHfGKGnIYKS8t9mH1kpOB5Lqdm5s2vFHyksOy4oVAhlyBdZs7e1tbtZds09XT491SX3/eF+6+W/N8OIQacpgpLy2uxCoGORHgvWcfLGuo1NBFBAiWitMHUNXcrIZss76m5rbT7ruvzGkd8YwacmR4BauE0DiAN5b96fnurs4OZyXFPIEG9vYVPd3a0KADe8DOxsb3f7NixS+d1hHvqCFHgPLS4m7gPqxST8kNe7Y1ryl59Bm9USysTFpa5h03QFsTVtHTZC16Cg3t7fXrqqu/YtceVBxEDTlC2HkuHgcmAZSveHbjtrI3VjirKuYJVvQ0c+PevY2d3d1x+2ul3evtXL5p09e/8cgje5zWoqghR5rlwBpsU3592e2v1e/aorkuwkewgb0MiN+ipz0+n++F9euLrnzqqeVOa1Es1JAjiF1/7z6gARiDMSy/59Z/tTc37nVYWqyiRU8DsGLLloef/Oij3zutQ/kUNeQIU15a3AzcgRVPTm9vbux6/eE/Pqbzk8OCFj0dgPd37379vlWrrrVLWylRghqyA5SXFu8C/oZ1F19C1caP6ta88MjTOsgXcg5dWubNHKCtDvACCevjrOjp5rq6TXe99dbZJRUV2gmIMtSQHaK8tPh9rAojU0EH+cJEsKKn24HMdVVVe7t9vrgo3lnT0lL3+IcfFj6zbl2D01qUz6OG7Cz/ps8gX+22io+clRRzBApbxFXR04b29uZ/ffzxhbeVlmrC+ShFDdlB+hvke+nOn/+rbsemcoelxRKBDHk7cVL0tK6tzXPPu+9ec+MLL/zXaS3KwKghO4zfIF8ikOPr6TYld/z0aS39FDIGdQt1LBc93dva2vjH11//2Zo9ex5zWosSGDXkKMAe5Ps9kA5k+bq9vhfv+OlTDXu2b3BYWiwwZ2mZN2mAtt4whWyorY3JHnJNS0vDH1577abNdXV/1Tvxoh815CihvLR4C/A7YBQwqruro+fFP9/8RGPVzs0OSxvpJAP5/TXYRU/3YBc99cXYNJfq5ub6P7z22o3bGhru1eltIwM15CiivLR4E1ZPORvI9Ha295T86abHPNW7tzosbaQTdGCvtauru6mjoy5SgsJNZVPT3t+/9tr1Oxob71czHjmoIUcZ5aXFG4D/BQ4AMrwdbd0v/vnmR5tqK7c7LG0kEyjz22ZirOjpbo+n9nelpdft8niWaZhiZKGGHIWUlxZ/AvwRGAukd7Y1e0v+dNM/daBv2ATqIe8rerq7qWnED+x9uGfP5luXL/9OZXPzI2rGIw815CilvLT4Y+B2rBzKGZ2tTd7i//3R4zvXrX7TYWkjkXlLy7wyQFsV9tS3zXV1I7aH7DPG91x5+erfrlz5w8aOjmfUjEcmashRTHlp8VosU84BRmMMK+77zSsfv/rvZ32+Hi2zM3iygEP6ayipqGjHmm2R9uGePSOyh9zu9bbf9dZbLz2yZs3/AC+oGY9c1JCjHNuUfwX0YFewfv+5h9a88/hfH+7u6mh3VNzIImjR05rW1vbWri5PpASFgpqWlr3uV1559K3t2/9fSUXFG2rGIxs15BFAeWnxduAWrClaUwHZ9O7yHa/89ZZ7NXXnoAk20yIdRlbR03VVVZtvfvHFO7c1NNxQUlGh0yNjADXkEUJ5aXED1jzlVcDBQGLN1k8a/vOHG+7TucqDIliNvR6AyhFQ9LTH5+v5zyefvPerV1+9pbWra2lJRYUmCooR1JBHEOWlxR3A/wFPY/WU09o8dZ3Fv/+ff27/8J3XjfHpz9WBCXYLtVX0tL4+qgf2qltaqn716qvPLysruwFYVlJR0eW0JiV0SIzdnBQ35BcULgSuAZqBRoDpC0+ZetQ5l381JT0z21Fx0cuEm+YnVfd9cXFengB/Blqmjx6d9uszzrgu8tIC0+3zdS/ftOn9h99/f3WPMX8uqajY5LQmJfRoD3mEUl5avAr4NdYc2imAbF716o7nb7vur5rCc0ACFT3dDIzaXF/v6ezujqrB0j1NTbuKXn65+IH33nuwx5ifqxnHLmrIIxg7/8UvgXeBadghjJI/3fTMmhceecLb2d7qqMDoI1jR03SAuigpeurt6el6rrz87R8VF/9rc329G7inpKJiRM0CUYZGotMClP2jvLS4Jb+g8F7gQ+AKrDwYVWtffvKTLe+/vm3RpdcvHjdt5lxnVUYNgQx5d++DqubmqolZWQdHQM+AbG9o2PaXt99+f0dj43+Ap0oqKpqd1KNEBjXkGKC8tNgA7+YXFG4BLgPmAJUtdVXtJX/6yTOHnXrux3NOOfeMlIxROY4KdZ5gA3sCsKOxsXLBpEmRUdSH2tbW3U+uXfvRa1u3rscqXlCuc4vjBx3UizHyCwpdwPHAJfZLlQCJyakJR3/1yoUHL1i0KDE5Jc0xgc5igOyb5id9rre5OC/PBdwN7D0iN/eAm08++buRFNbQ3l7zXHn5+yUVFTXAy8C/Syoq2iKpQXEeNeQYJb+gcCzwTeAIrBJRjQAZOWNTF57/7UWTZs9f6EpIjMdfSItump/0Rn8Ni/PyfgLkJrhcnoe+/vWfJrpcYT8/zZ2dDS9WVKz+18cfV/mM+RgrPLEl3PtVopN4/EDGBeWlxXvzCwpvxwpfXIQ16FfT2ri3bcV9v/nvmKmHrlr41atOGXvQzMNFBsq7E5PMB/o1ZKw79mb0+HyNDW1t1eMyM8MWt2jr6mpevnnzqic+/HC31+fbDDwBVGh4Ir5RQ45h7NjyR/kFhb8AjgEuwErpWVW3Y5On5E83/Wvq4ce+Pf+si0/PPnByv8l3YpBAceQd2DOPqltaKsNhyLWtrbvf3LZt3b8//ri2o7t7J/A48JEmkVdADTkuKC8t7gbezC8o/AA4GTgHy3gqd6x9p2rH2neWHXrMqVPzTlh8zOhJB88SlyuWp0MGG9gzADs9nqo5EyaEZIfenp6uDXv3fvSf9es/+WD37h6gDngM+KCkokKz9in70BhyHJJfUJgDnAWcBniBauxcDqMnH5J1+Be/dlRu3rwjk1JS0x2UGS68QOZN85M+d8vx4ry8ROBvwJ7jDjoo9wfHH3/V/uyovq2tetXOnR88vW5dVXNnp2Cd52LgXbuen6J8BjXkOCa/oDAXWIw1K0OAWqAdICklLeHwL3197rQFJyzMyBmb66DMcDD/pvlJa/prWJyXtwQYlZGc3H7veef91DXEAHtHd3fb1vr6ipc3bCh/e8cOL9YX3TvASmCzxoiVQKghK7095mOBM7GqXjcD9b3thxx90pRZJ5y5cPTkg2fFyMyMK26an/RAfw2L8/IuBk4EKv927rnfzUlLGxdsYw1tbdUb6+o2rNq5c9Nb27d3+oxJxQpLlACrSyoqmkKqXolZYuHDpewn5aXFjcCL+QWFy7FmZZwFTMcOZ2xZvXLnltUrd6ZkZCXlHf+lQyflHzn7gEnTZiQmpaQ6qXs/mA/0a8jAFqxQDjWtrZX9GXK3z9e9p6lpyyc1NRtXbt68ZWtDQyKQihV//gh4FdigA3XKUFFDVvZRXlrsBcryCwrXYCUsOhk4Aes6aelsbWpY+/KTn6x9+clPEpKSXYcee9q0qXMWzhozdcas5LT0UU5qHyLBBvZ6AHZ7PFUzx449vLO7u722tXXPbo9nd0Vt7c6VW7bUtnm9o7DCPMlAGfAesFF7w8r+oCELJSD5BYVZQD5wHHAYlgl1Yv0k7wZAhGnzj580bf4Js8ZMPuSQ9OwxE6J8pkYz1h17n7v4F+flpWLdsbdzYlZW+qiUlOSK2tpmrLp8o7B6wbXA28DHwDYdoFNChRqyMmjyCwozgJnAQmABVs+5dxpXZ+9yyWkZiVOP+MKkCdMPm5IzcdrkzNHjcpPTMrIcET0wM2+an7Sxv4bFeXm3YdUv7MEyYC/WTSOrsbLC1ergnBIO1JCVYZFfUJiMFWc+Euumk3Q+7T03Yc/W6GXU2Nz0ibPnTxg75dDcUeNyx6VkjMpKSc/MSkpNz0pITEoKl84er7fL29nW3Oap39uwe1s7wsvTjz75YWD9TfOT+p0DvDgv7xhgErAVKwvcXo0HK5FADVnZb/ILChOwepRTgNlYA4MHYPUuBcucW+hj0r1k5IxNzZl4UNaosRNGZY4en5WeNTordVT2KFdCYiIiIuISERHrsfVC72Nfd3d3Z3tLS2dLc2tHi6elram+ta1hb0tzfXVrU/Xulq721m6snnwGMB54r7y0+PbwnxVFGTpqyErIyS8oFKyYay4wESvMMR3rtm0fnxq1CysO3en3N5QacYJltsl+f0lAQp/9eLFui96MZchaFFaJStSQlYiRX1CYgmXUvQNkWcA44ED7/2j7tUAXpX+bC2uAzoM1b7oBK57dALRi9cr3Ao12Xg9FiWrUkJWowg5/JGCZrQT48wFt5aXFmgtCiRnUkBVFUaKEaJ4rqiiKEleoISuKokQJasiKoihRghqyoihKlKCGrCiKEiWoISuKokQJasiKoihRghqyoihKlKCGrCiKEiWoISuKokQJasiKoihRghqyoihKlKCGrCiKEiWoISuKokQJasiKoihRghqyoihKlKCGrCiKEiX8f2RxgMJwgJqHAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# -----------------\n", + "# Visualize results\n", + "# -----------------\n", + "''' CHANGE TO STACKED BAR; SHOW WHAT ASSIGNMENTS CAME FROM WHERE '''\n", + "\n", + "# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/\n", + "totCount = len(logAfterGoldStandard)\n", + "unassigned = logAfterGoldStandard['SemanticGroup'].isnull().sum()\n", + "assigned = totCount - unassigned\n", + "labels = ['Assigned', 'Unassigned']\n", + "sizes = [assigned, unassigned]\n", + "colors = ['lightskyblue', 'lightcoral']\n", + "explode = (0.1, 0) # explode 1st slice\n", + "plt.pie(sizes, explode=explode, labels=labels, colors=colors,\n", + " autopct='%1.f%%', shadow=True, startangle=100)\n", + "plt.axis('equal')\n", + "plt.title(\"Status after 'GoldStandard' processing\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmsAAAGDCAYAAAB0s1eWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzs3Xl8VNX5x/HPNyyGsMuioGIsssWFEAErWoVWUVwqCq5gQVu3H4q41PantXXrr9oqtCJVcKNWVMBd1KKtIlJFdoFghFZkEQRRQDBATPL8/rgncRgmG9tMyPN+veaVmXPPPee5Z25mnpy7RGaGc84555xLTWnJDsA555xzzpXNkzXnnHPOuRTmyZpzzjnnXArzZM0555xzLoV5suacc845l8I8WXPOOeecS2GerDnnyiXJJPVPdhw7Q1JmiL9rsmMBkLRQ0u272EaGpOckbQzblrlbgksSST3DdjQvp05/SSl9nylJD0qasofaTvnt31skfSbppmTHsbd5subcbiDpAEl/kfRfSdskfS7pDUmnV6GNwZI278k4d1Ir4NVkB7GTVhDFPy/ZgexGlwEnAicQbduKPf0FFvbNKXFlDSXdERLQfElfS5ot6ZbyEq/dFE8tSb+S9HHoe72kWZKGxtQZK2nSnowjWRK9HzVIN+CvyQ5ib6ud7ACcq+7CzMa/gU3A/wIfEf0h9BPgYaBNsmLbFZLqmlmBmX2R7Fh2lpkVAdU2/jIcDnxsZgtKCiTtloZL3vNK1GsKvAc0AX4HzAK2htjOJ0oo/7hbgkrsd8D/ANcAM4AGQBeq6e9aCUl1zOy7mtp/ZZjZl8mOISnMzB/+8McuPIDXgVVAgwTLmsY8vwGYD3wLfA48CjQJy3oCFve4PSyrC9wLrAzrzgROjevnDOAToi/MqcCFoY3MmDrnAguAbUQzTrcCiln+GXA78DiwAZgYyg3oH1PvIOBZYH14vAa0i1l+CPAy8DWQD+QBF5Yzft2AN4F1wDfANOC4uDpXAovD9n0JTAZqh2VHAf8K624iSpZ7hWWZIf6ulR0rYDCwmSjZXhjG/B3gsLiYzgJmh3aWAr8H6sYsbxnGYQuwjCiBWVjyvpYxFm3DOl+EfucAZ8YsnxK3j0xJUGYx9XsA74b34XPgIaBRXHsPAfeFcZ1ZRlyDgSkxrx8KY3RQGfVj96umwN/CvrIF+CdwRMzyniHu5jFlPwtjlg9MAobEbdc84O5yxvH2+DEBeoZl94T3fwvRPv9HID1u3YVhv/gv0T71Ulx8tcKYlfwO/DmMSewYnUaU0K4n+l2YDHSKWZ4Z4roIeDvEc00lt3+79yPB9pe0fTHR79NWot/D3gnG/XSihLeAsK8R/b79J5T9B7g8rv1GYXtXh7Y/Bi6own53IjA97EMbgQ+BI8OyxsDfgbWh7U+BYXGfUzfFvDbgCmAi0e/Mp8DAuHiPJfpd2grMDdtcuk9Uh0fSA/CHP6rzA9gfKAZuqUTdYcCPwwfpSUSJ29/DsrrAdeHD5sDwaBCWjQsfbCcCPyCaTSgAOoflbYgSsOFAB6A/sJztE5BjgCLgDqA9MCB8UF4bE99nRAnPzUQzJO1CeWmyBmQQJU1jgaOBjkRJ5zIgI9R5FXgL6AwcRvSldVo54/Jj4BKgU2jvQaIvuOZheVegMMR8aGj3er5P1hYAT4V1DwfOISR7xCVrlRyrwcB3RElF97Cdc4HJMTGfGsbqUqIEqxdRAnBfTJ3XgVzgeKJZnylhzG8vZyw6A1cRJaCHEyXUBUDHmP3tceB9on1k//BYEd7bA4EDQ92jQn83Au2IvrA+AJ6L6W8KUTJyfxi/TmXENZiQHBDNGq8HHq7k78jLRInCiSGmV0K89cLynsQkayHO4rDt7YkSh6/YPln5B1ESckAZfTYAxhPthyW/T3XDstvCe5JJ9KW9HLgrZt3bw7i9GN7744j279ExdW4mSjLOD+M2MuwPU2Lq9AuPdqGdCUSJT0kcmWG7PyPaDw8DDq7k9g+mcsnayrgYtxAS7JhxXwD0JvpsaUH0+/Md0edMe+Da8PqssJ6IjiQsIvrd/gHQBzinMvsd0RG99UTJbtsQ28WEfS/EOY/ody8zxHle3OdUfLK2EhhI9DvzB6LfmUNj9oUvgaeBI4BTiH4vPVnzhz9qyiN8oFjJB1UV1z2NKHFIC68HA5vj6rQNH9xt4spfAv4anv+B6C/b2NmMW9g+ARkHvB3Xxu3AypjXnwGvJogzNlm7DFgS11ctoi+T88Pr+cDvdmFMRfQX+8Dw+lyiL8aGZdT/BhhUxrJMtk/WKjNWg8PrDjF1BoQvgJL3aipwW1xffYm+pET0JWfA8THLDyVKmG+v4nhMB34T8/pB4r6o47/AQtmTwGNxZdkhrpbh9RRgfhXjOSC0cX1c+fth+zcDb4SydqHuiTH1Gof38xfhdU+2T9aeBt6Ka/tRtk9WsoiShWKiL95Hw34S+76OBSZVYnuuAv4T93uxFWgcU3ZrXJ1VwK0xr9OI/oiZUk4/9cP7f0LcvnljXL0Kt78S21TSdqIY744b935x6/4beDyubCwwLTw/JYx7WYl9ufsd0R8XBpxUxvqvAE+Us23b7euhrT/EvK5NNKNX8vlxJdHMZr2YOhdTzZI1v8DAuV1T6ZOFJP1Y0luSVkraBLxANKN2YDmr5YQ+FknaXPIgOpTXNtTpSHT4ymLW+zCunU5EH8KxpgEHSWoUUzargs04hmgGYFNMLBuJDnWVxPMX4DeSPpB0t6RjymtQUktJoyUtlrSRaKanJd+ff/QW0czGUknjJA2S1DCmieHAo5LelnSrpI7ldFeZsQLYZmafxLxeBdQhOkerZBxujXtPnib6Qj6QaLyLiQ4vAWBmy0I75Y1FfUl/lLQonDS/mWhmcWfOxToGGBgXY8k+0Dam3uydaDuRC4i+lF8E6oWyknH4oKSSmW0kms3JKqOdTrH1g+1em9ki4EiiWZtHgWZEM1evSSr3ey1cWTlN0hdhTEaw4/guC3GWWEW0TyKpMdGFHbHbVEzcfiSpraSnw0VH3wBriBKm+L7if+cq3P4qSBRj/Lgn6j/RZ0XJel2A1Wb2cRl9lrvfmdnXRMnfZEmvSbpB0iEx6z8EnC/pI0n3STqp4s1kfskTMyskmklrGYo6AgvNbEtM/US/8ynNkzXnds0Sor/QOpVXSdKhROd2fQycR/SBdllYXLecVdNC+92IvghLHp1i1leoU24I5dSJLf+2gnbSiA5RZMc92gOjAczsMaKE7olQ/n4Ft6v4G9H2XU90rks20WGNuqG9TURJ6/lEh6z+F8iT1Dosv53oi+SlsP58SZeRWGXGCqLDrrFK1kmL+XkH24/B0UQzSV9ShSQ+zn1E+8dtRIfKs4kSvvL2kbKkESUysTF2DjHGXh1b0Xse70uicxq3S4rNbIWZ/YcoeS9R3jiU9T5UauzMrNjMZprZCDM7h2hGtA/R4dbEDUs/JDrfcjLROYddgN8QJeKx4k+yN6r+ffkq0WHFK4mSyi5E+1X8exk//rvnapHKS/T+J3pvSsoqiq/C/c7MLiUak6nAT4HFkk4Ny94gmoW+D2hOlIA/UUGf5b1flf2dT2merDm3C8JfiZOBayQ1iF8uqWQmpivRh/T1ZvaBmS0GWsdVLyA6pBhrLtGHzYFm9p+4x+ehzsdEyU6s7nGvFxHd6iHWCUSHQTeVv5XbmUN0Xsi6BPF8XVLJzFaa2RgzOx/4LdEJwGU5ARhpZq+ZWS7RzFqr2ApmVmhmb5vZ/xIlRfWBM2OWLzGzB8zsDOAx4Bdl9FWZsaqMOUTnkcWPwX/CX/YfE32+lvYlqQ07vufxTgCeNLPnzWw+UdLatoJ1IPG+M4foRP5EMW5J0EalhBma8USzJ4dUUH0R0TgcV1IQZnKPCsvKWueHcWXxr8taD6JzlCDxmBwPfG5md4VEbwlRYlBpYcZtdWxMii7H7R7zuhnRH1T/Z2b/DLNQDancHRh2dvsTSRRjWTNiJT4m8WdFyfjOAVpJKusP1Ertd2b2kZnda2Y9iQ7HD4pZts7M/m5mg4GfA4Mk7VdB3OVtz1GS6sWU7czvfFJ5subcrvsfooRqlqTzJHWQ1FHS1Xw/Pb+E6PdtmKTDJF1EdMFBrM+AdEmnSGouKSMkdeOAseHwzQ8kdZV0k6Rzw3oPA23DIYMOofzKsKzkL8r7gZMk3S6pvaQBRCcAV/X2CuOIDue8LOmksC0nSrpfUjsARfebOy3Emk10bl5ZX8wQnUczUFKWpG5EMx+lt4+QdKak6yR1CTOUFxN98X0sqZ6kUYpurJop6Vi2/2KJV5mxqow7gYsl3SnpyPB+95f0R4BwCPUfwGhJx4VxGEt0gnd5FgPnSMqRdBTRhRPplYjnM+BHkg7S9/c4uxfoLunhMHaHh7EcXYXtLMstRLOc0yX9QlLncNjvp0TnQhVBlEQTXWAwWtKPYrbpG6LDxok8AJws6X8ltZN0OdFJ76UU3RT4eknHSjpUUk9gFNEVhO/HjMmR4X1uLqkO0fgeJGlA2D+vJroas6r+Atwc3vMORFeDxv6BsZ7o6ubLw7ifRLTvxc/Y7tT2V8HVcTEeSnSYsTx/Ai6RNCT0fy3ROZslnxX/IjqM+LykU8NnwCmS+obl5e53of49knqE964X0R9gi8LyOyX1DX13IjoX8VMz27aTYzCOaH98JHzGnEy0/0J1mnFL9klz/vDHvvAg+qAeSXTZ+Daic1zeAPrE1BlKdBn7FqIPvPPZ8fYaDxF9yBvf37qjDtFJz58SJTFfEJ2Ee0zMemfy/a0t3iO6StGIuVqO72/dUUDZt+64KcG2lV5gEF4fQHSIc23Y1qVEVyiWnCA+kig5LbnNxrOUcYuHUL8z0Yf/FqJbJVxCzC0uiJKvd4guYtgSll0altUl+tJfFjPuYwi3CSDxrTvKHSsSX+jRkx1vL9E7rJ9PlHzMItx6IWacXgkxryCa7avo1h2HEl2F+i3RrNpNRLduGBtTJ9EFBj8kumXJVrY/Eb8rUdL4TWhzAXBnzPIpwIM7uc83Au4m+pLdEh7zgf8jXMAQ6u3MrTsuJUoGtxD9Hl0Tt12Xh3bW8P2taJ6Na7cF0S1hNrH9rTv+QLRfbiY6b/TquLZvJzrHKXZbt9sniGbIRhAdDt5AtM/H37rjx+H93hp+nhr6HFzWvlnZ7a/Ee1PS9gCi5HUr0dXKsZ9HO4x7zLKriK5c/Y7Et+5oAjwSxnFr2AfOr8x+R/R78QLRZ+G2sJ1/BOqE5bcSXTSST3RhwOtsf8uTz9jxAoP+cfHF1/kh0VGKbeFnv7DesTuz7yfjobAhzrl9iKTriGZ/mlp02MqVwcfK7WsU3ah7KdDNzCq6aKjGkXQ20YUwLc1sXbLjqQz/DwbO7QMkDSG6We6XRH9F3kY0G+PJRxwfK+dqFkmDiI5MrCC6ivjPRLcpqhaJGniy5ty+4nCi8zCaER0+e5hotsjtyMfKuZrlAKKrt1sRnUbyGvCrpEZURX4Y1DnnnHMuhfnVoM4555xzKcyTNeecc865FObnrLlqr3nz5paZmZnsMJxzzrkqmT179joza1FRPU/WXLWXmZnJrFl+dbpzzrnqRdKyytTzw6DOOeeccynMkzXnnHPOuRTmyZpzzjnnXArzc9Zctbd8aT5DBsxJdhjOOef2cb++pwXjx48nLS2NtLQ0Bg4cSPPmzQGYNGkSM2fO5I477gBg7NixrF+/nm3bttGtWzd+8pOfsHr1ap599lkACgsLueyyy7Ir068nazWIpCKif6hbBygk+ufKfzazYkldgZ+Z2dDd3OftRP8A+b7d2a5zzjm3tzVq1IhrrrmG9PR0Fi5cyKRJkxg8eDDffPMNa9eu3a7uwIEDqV27NkVFRdx5550cf/zxtGrViuuvvx6A2bNnM3ny5K8r068fBq1ZtphZtpkdAZwCnA78DsDMZu1qoiap1q4GKMn/gHDOOZeSGjduTHp6OgC1a9cmLS1Ko9544w169+69Xd3ataOvs++++47999+funXrbrd8xowZLF682JM1VzYzWwtcAVyjSE9JkwAknSRpXnjMldQw1PmTpIWSFki6INTtKekdSU8Tzdoh6VZJn0j6J9ChpE9JbSX9Q9JsSe9J6hjKx0oaLukd4N5E/e/l4XHOOefKtG3bNl555RVOOeUU1q5dy7Zt2zj44IN3qPfII4/w29/+lrZt25YmdgCbN29mzZo1rFixYnNl+vNZjBrMzD6VlAa0jFt0EzDEzP4tqQGwFTgXyAY6A82BmZKmhvrdgSPNbKmkY4ALgS5E+9ccYHaoNwa4ysyWSDoW+Cvw47CsPXCymRVJejVB/84551zSFRUV8dhjj9G7d29atWrFE088wZlnnpmw7uWXX05BQQHDhw/nmGOOoVWrVkB0CLRLly6V7tNn1pwSlP0bGC5pKNDEzAqBE4BnzKzIzNYA7wLdQv0ZZrY0PP8R8KKZ5ZvZN8ArACHp6gFMlDQPGA20iulzopkVldP/9kFLV0iaJWnWlq3rd2HznXPOucopLi7miSeeoHPnzmRnR9cGrFu3jvHjx/Pggw+yceNGJkyYgJlRWBh9ddWuXZs6depQp06d0nZmzpxJ9+7dK92vz6zVYJJ+ABQBa4FOJeVmdo+k14jOaZsu6WQSJ3Ulvo17bQnqpAEbzKysK19K20jUv5nlbdeB2RiimTpaNstK1J9zzjm3W82bN4/c3Fw2bdrEjBkzaN26Nb/85S9Ll//ud7/j/PPPp6ioiJEjRwLRVZ/HHHNM6VWj69ato7CwsHSWrTI8WauhJLUAHgYeNDOTFLusrZktABZIOg7oCEwFrpT0N2B/4ETgl2FZrKnAWEn3EO1fZwGjzewbSUslnWdmExV1eLSZfZQgtkT958XXc8455/amnJwccnJyylxectuOWrVqlV71Ga958+b8+te/rlK/nqzVLPXCIciSW3f8HRieoN4wSb2IZt0WAW8ABcBxwEdEM2c3m9kXJRcJlDCzOZLGA/OAZcB7MYsHAA9J+k2I4dnQXmX6d84552okmfkRJFe9tWyWZeed9lSyw3DOObePGzWu7Fm1nSFptpl1raiez6y5aq/NYRm7/RfIOeecSxV+NahzzjnnXArzZM0555xzLoV5suacc845l8I8WXPOOeecS2GerDnnnHPOpTBP1pxzzjnnUpgna84555xzKcyTNeecc865FObJmnPOOedcCvP/YOCqveVL8xkyYE6yw3DO1TCjxuUwcuRIVqxYQa9evejTpw8A06dP58MPP6S4uJgTTjiBbt26sWbNGp5++mkADjnkEPr164ckNm/ezLPPPsvmzZtJS0tj6NChydwkl6I8WXPOOed20sCBA8nLy2PDhg0ArFq1iry8PIYOHYqk0novvvgiffv25bDDDuOZZ54hLy+PTp06MXHiRE4//XRat26drE1w1YAfBgUkHSjpWUn/lbRI0uuS2kvqKWnSHurz/Z1c73ZJN+3ueMroK+H2h/KNkuZK+kTSVEln7o2YnHMulTRt2nS713PnzmW//fZj5MiRjB49mvXr1wOwdu1a2rRpA0BmZiaLFy+muLiY1atX869//Yvhw4fz7rvv7vX4XfVQ45M1RX/6vAhMMbO2ZpYF3AIcsCf7NbMee7L9veA9M+tiZh2AocCDkn4SX0mSz94652qMjRs3snnzZq699lp69OjBCy+8AEDr1q1ZtGgRZkZubi7ffvstmzZt4vPPP6dXr14MHTqUWbNmsXr16iRvgUtFNT5ZA3oB35nZwyUFZjbPzN4LLxtIek5SnqRxIblD0jGS3pU0W9JkSa1C+RRJI8Js08eSukl6QdISSXeX9CFpc8zzmyUtkPSRpHtC2eWSZoay5yVlxAcuaWiYCZwv6dkEyzMlvSdpTnj0COU9Q5yJtuu0UDYNOLcyA2hm84A7gWtCG2MlDZf0DnBv/GygpIWSMsPz20J/b0l6pqReRdvmnHOpKCMjg6ysLCSRlZXFqlWrAOjXrx/vv/8+DzzwABkZGTRp0oSMjAwaN27MwQcfTO3atWnXrl1pfediebIGRwKzy1neBRgGZAE/AI6XVAcYCfQ3s2OAx4Hfx6xTYGYnAg8DLwNDQj+DJTWLbVxSH6AvcKyZdQb+GBa9YGbdQtnHwM8TxPZroIuZHQ1clWD5WuAUM8sBLgAeqGC70oFHgLOAHwEHljMu8eYAHWNetwdONrMby1pBUlegX4jlXKBrZbdN0hWSZkmatWXr+iqE6Zxze0779u1ZtmwZAMuXL6d58+ZAdLj0yiuvZOjQoRQUFJCdnU2dOnVo3rw5X3/9dWn9Fi1aJC12l7r8EFXFZpjZSgBJ84BMYANR8vVWmJCqBcTOXb8Sfi4Acs1sdVj/U+AQ4KuYuicDT5hZPoCZfR3KjwwzcU2ABsDkBLHNB8ZJegl4KcHyOkSHJ7OBIqIEqrzt2gwsNbMlofwp4IqyBiaO4l5PNLOiCtY5AXjZzLaE/l6NWVbutpnZGGAMQMtmWVbJGJ1zbrcaN24cn376KYWFhSxbtowrr7ySRYsWMWLECMyMiy++GICZM2cybdo0JHHssceWXlBw3nnnMXbsWIqKiujQoUPpeW3OxfJkDXKB/uUs3xbzvIhozESUhB1XwTrFcesXs+OYC0iUbIwF+prZR5IGAz0T1DkDOBH4KXCbpCPMrDBm+fXAGqAz0Szq1gq2izJiqYwuRDOAJb6NeV7I9rO46eFnfIIXq6Jtc865pBswYMAOZf377/iV0q1bN7p167ZD+SGHHMINN9ywR2Jz+w4/DApvA/tJurykIJxndlI563wCtJB0XKhfR9IRO9n/m8BlJeekSdo/lDcEVodDrjt8GkhKAw4xs3eAm/l+Bi5WY2C1mRUDlxDNAJYnDzhMUtvw+qLKbICko4HbgFFlVPkMyAl1c4DDQvk04CxJ6ZIaECVold0255xzrkao8TNrZmaSzgH+LOnXRLNPnxGdz3VQGesUSOoPPCCpMdE4/plolq6q/f8jHKacJakAeJ3oatTbgA+BZUSHUxvGrVoLeCr0L2CEmW2Iq/NX4HlJ5wHvsP1sV6JYtkq6AnhN0jqiZOrIMqr/SNJcIIPo3LihZvavMuo+D/wsHG6dCSwO/c2U9ArwUdjOWcDGSm6bc845VyPIzE/3cckjqYGZbQ4zi1OBK8ysSv+OoGvXrjZr1qw9E6Bzzjm3h0iabWZdK6pX42fWXNKNkZRFdB7b36qaqDnnnHP7Ok/WXFKZ2cXJjsE555xLZX6BgXPOOedcCvNkzTnnnHMuhXmy5pxzzjmXwjxZc84555xLYZ6sOeecc86lME/WnHPOOedSmCdrzjnnnHMpzO+z5pxz1dTIkSNZsWIFvXr1ok+fPnz55Zc8+uijrF27liFDhnD44YcDMHHiRJYuXQpA586dOfXUUwFYt24dEyZMoKCggCZNmjB48OBkbYpzrhz+76ZctdeyWZadd9pTyQ7Dub1q1Lgc1q9fT15eHhs2bKBPnz4UFBTw3Xff8fzzz9OjR4/SZG3t2rW0bNmS4uJi7r//fgYPHkyLFi0YNWoUAwcOpHHjxkneGudqpsr+uyk/DFoJkg6U9Kyk/0paJOl1Se2THNMte7m/seGf11e2vKekSXsnOudqpqZNm273um7dutSvX3+Hei1btgQgLS2t9PHVV19RUFDAxIkTGTFiBHPnzt0rMTvnqs6TtQpIEvAiMMXM2ppZFnALcEByI2OvJmvOuervww8/pHnz5jRr1oyNGzeyYsUK+vfvz1VXXcWkSZPIz89PdojOuQQ8WatYL+A7M3u4pMDM5pnZe4r8SdJCSQskXQCls0pTJD0nKU/SuJD0IambpPclfSRphqSGkmqFdmZKmi/pyph2pkp6MczoPSwpTdI9QD1J80Lb9SW9FtpcWBJHLEmXh/Y/kvS8pIxQPlbSAyGmT0tmycK2PRj6fQ1oWdFASTotbO804NyY8vqSHg/9z5V0dijPlPSepDnh0SOUtwrbPS9sz4929s1zzkXy8vKYPn06F110EQD169fnoIMOokmTJtSrV4+DDz6YtWvXJjlK51wifoFBxY4EZpex7FwgG+gMNAdmSpoalnUBjgBWAf8Gjpc0AxgPXGBmMyU1ArYAPwc2mlk3SfsB/5b0ZminO5AFLAP+AZxrZr+WdI2ZZQNI6gesMrMzwutEJ6C8YGaPhOV3hz5HhmWtgBOAjsArwHPAOUAH4CiiWcRFwONlDZKkdOAR4MfAf8J2lrgVeNvMLpPUBJgh6Z/AWuAUM9sqqR3wDNAVuBiYbGa/l1QLyCirX+dcxZYuXcqrr77KkCFDqFu3LgAtWrSgoKCArVu3UqdOHb744gv233//JEfqnEvEk7VdcwLwjJkVAWskvQt0A74BZpjZSgBJ84BMYCOw2sxmApjZN2F5b+DomHO/GgPtgILQzqeh3jOhz+fi4lgA3CfpXmCSmb2XINYjQ5LWBGgATI5Z9pKZFQOLJJUc3j0xZttWSXq7grHoCCw1syUh1qeAK8Ky3sBPJd0UXqcDbYgS2QclZQNFQMl5gDOBxyXVCbHNi+9M0hUl7TfIOLCC0JzbN40bN45PP/2UwsJCli1bxqBBgxgzZgxffPEFq1at4sgjj+TMM8/kqaeiC3BGjx4NQL9+/WjTpg19+/Zl1KhRFBUVcfzxx9OoUaNkbo5zrgyerFUsF9jhBPpA5ay3LeZ5EdFYC0h0+a2Aa81s8naFUs8E9XdY38wWSzoGOB34g6Q3zezOuGpjgb5m9pGkwUDPMmKN3aaqXipcVn0B/czsk+0KpduBNUQzk2nAVgAzmyrpROAM4O+S/mRmT27XkdkYYAxEV4NWMU7n9gkDBgzYoey6667boey2225LuH6nTp3o1KnTbo/LObd7+TlrFXsb2E/S5SUF4byzk4CpwAXhnLMWRLNRM8ppKw9oLalbaKehpNpEs1xXh5kkJLWXVHJJV3dJh0lKAy4ApoXy72Lqtwbyzewp4D4gJ0HfDYHVYZ0dP+F3NBW4MGxbK6Jz98qTBxwmqW14fVHMssnAtTHn7XUJ5Y2JZhqLgUuAWmH5ocDacNj2sTK2xznnnKsRfGaCcnrlAAAgAElEQVStAmZmks4B/izp10SzP58Bw4gSmuOAj4hmlW42sy8kdSyjrYJw8v9ISfWIzlc7GXiU6DDpnJDQfAn0Dat9ANxDdO7YVKIrUyGaVZovaQ7wJPAnScXAd8DVCbq/DfiQ6Ny3BUTJW3leJDr/bAGwGHi3vMrhvLMrgNckrSNKKo8Mi+8C/hziFdH4nQn8FXhe0nnAO8C3oX5P4JeSvgM2Az+rIFbnnHNun+U3xU1h4TDoTWZ2ZrJjSWV+U1xXE40a5xPOzlV3quRNcX1mzVV7bQ7L8C8u55xz+yxP1lKYmU0BpiQ5DOecc84lkV9g4JxzzjmXwjxZc84555xLYZ6sOeecc86lME/WnHPOOedSmCdrzjnnnHMpzJM155xzzrkU5smac84551wK82TNOeeccy6F+U1xXbW3fGk+QwbMSXYY+5yS/wpx3XXXkZmZCUD37t3p0aMHEyZMYOXKldSrV49BgwZRv359XnjhBZYtWwbAmjVrOPXUU+nVq1eywnfOuX2GJ2vOuXI1adKE66+/vvR1bm4uBQUF3HjjjUyfPp233nqLvn37cu6555bWufvuu+nSpUsywnXOuX2OHwbdh0gqkjRP0kJJEyVl7MW+p0iq8J/Ruurnm2++Yfjw4YwePZqvvvqKJUuWcNRRRwFw9NFHs2TJku3qL1++nIYNG9KkSZNkhOucc/scn1nbt2wxs2wASeOAq4DhJQslCZCZFScpvlKSaptZYbLjcBW76667aNCgAYsWLeKpp56iefPmZGREfwfUq1eP/Pz87erPmDGD7t27JyNU55zbJ/nM2r7rPeBwSZmSPpb0V2AOcIikiyQtCDNw95asIOk0SXMkfSTpX6GsvqTHJc2UNFfS2aG8nqRnJc2XNB6oF9PO5pjn/SWNDc/HShou6R3g3nLaPkLSjDBLOF9Suz0/XK4sDRo0ACArK4uvv/6a+vXrlyZoW7ZsKU3cAIqLi5k/fz7Z2dlJidU55/ZFPrO2D5JUG+gD/CMUdQAuNbP/kdQauBc4BlgPvCmpL/Bv4BHgRDNbKmn/sO6twNtmdpmkJsAMSf8ErgTyzexoSUcTJYKV0R442cyKJP1fGW1fBfzFzMZJqgvUSrCNVwBXADTIOLBK4+Mqb+vWrdStW5e0tDRWrlxJ/fr1adeuHfPmzSM7O5vc3Fzatfs+l87Ly6NNmzbUq1evnFadc85VhSdr+5Z6kuaF5+8BjwGtgWVmNj2UdwOmmNmXUHq49ESgCJhqZksBzOzrUL838FNJN4XX6UCbsM4Doe58SfMrGeNEMyuqoO0PgFslHQy8YGZL4hsxszHAGICWzbKskn27Kvriiy94+umnSU9PB+Diiy+mdevWLFiwgPvvv5/09HQGDRpUWt8PgTrn3O7nydq+pfSctRLRaWp8G1tUxroCEiU9AvqZ2ScJ2i0rSYotT49bFh/LDm0DH0v6EDgDmCzpF2b2dhl9uT0oMzOTW265ZYfyCy+8MGH9wYMH7+GInHOu5vFz1mqeD4GTJDWXVAu4CHiXaDbrJEmHAcQcBp0MXBsuTkBSyf0YpgIDQtmRwNExfayR1ElSGnBOObEkbFvSD4BPzewB4JW4tp1zzrkaxZO1GsbMVgP/C7wDfATMMbOXw2HRK4AXJH0EjA+r3AXUAeZLWhheAzwENAiHP28GZsR082tgEvA2sLqccMpq+wJgYTik2xF4chc22TnnnKvWZOan+7jqrWvXrjZr1qxkh+Gcc85ViaTZZlbhPUp9Zs0555xzLoV5suacc845l8I8WXPOOeecS2GerDnnnHPOpTBP1pxzzjnnUpgna84555xzKcyTNeecc865FObJmnPOOedcCvNkzTnnnHMuhfk/cnfV3vKl+QwZMCfZYaSkUeNyAFizZg133XUXw4YNQxLPPPMMa9eu5Y477qBp06YAjB07lvXr1wOwcuVKBg0axNFH+79ldc65ZPNkzbka4I033qBdu3YAtG7dmptuuomHHnpouzqDBw8GoLCwkDvuuINOnTrt7TCdc84l4IdBd4GkcySZpI5JjOEqST/bDe1khm25NqbsQUmDd7Xt0NYUSRX+/zO3+3322Wc0atSodAatXr16pKenl1l/wYIFdOjQgTp16uytEJ1zzpXDk7VdcxEwDbgwWQGY2cNm9uRuam4tcJ2kurupvd1Cks8A74I33niD3r17V7r+jBkz6Nat2x6MyDnnXFV4sraTJDUAjgd+TlyyJulmSQskfSTpnlB2uKR/hrI5ktqG8l9KmilpvqQ7Qll9Sa+FugslXRDK75G0KNS9L5TdLumm8Dxb0vSw/EVJTUP5FEn3SpohabGkH5WxWV8C/wIGJdje0pkxSc0lfRaeD5b0kqRXJS2VdI2kGyTNDbHsH9PMQEnvh23qHrOtj4cxmCvp7Jh2J0p6FXizim+PCxYsWMChhx5KgwYNKlU/Pz+fVatWlR4ydc45l3w+Y7Hz+gL/MLPFkr6WlGNmcyT1CcuONbP8mGRlHHCPmb0oKR1Ik9QbaAd0BwS8IulEoAWwyszOAJDUOLRzDtDRzExSkwQxPQlca2bvSroT+B0wLCyrbWbdJZ0eyk8uY7vuAd6Q9HgVxuJIoAuQDvwH+JWZdZE0AvgZ8OdQr76Z9Qjb+HhY71bgbTO7LGzTDEn/DPWPA442s6/jO5R0BXAFQIOMA6sQas2ycuVKFi9ezKeffsrnn3/OF198wc9//nOaNWuWsP6cOXPIzs4mLc3/jnPOuVThn8g77yLg2fD82fAaoiToCTPLBzCzryU1BA4ysxdD2dawvHd4zAXmAB2JkrcFwMlhNuxHZrYR+AbYCjwq6VwgPzYYSY2BJmb2bij6G3BiTJUXws/ZQGZZG2VmS4EZwMVVGIt3zGyTmX0JbAReDeUL4vp6JvQxFWgUkrPewK8lzQOmECV8bUL9txIlaqGNMWbW1cy61ktvWoVQa5Y+ffowbNgwrrnmGjp16sS5555LYWEhf/nLX1i5ciWPP/44U6dOLa0/Y8YMunfvnsSInXPOxfOZtZ0gqRnwY+BISQbUAkzSzUQzZBa/SllNAX8ws9EJ+jgGOB34g6Q3zezOcOjwJ0SHXa8JMVTWtvCziIrf9/8DngOmxpQV8n1yH392+raY58Uxr4vj+oofFyMag35m9knsAknHAt9WEKergp/97PvrUK677rqEdW644Ya9FY5zzrlK8pm1ndMfeNLMDjWzTDM7BFgKnEB0ftVlkjIAJO1vZt8AKyX1DWX7heWTQ90GofwgSS0ltQbyzewp4D4gJ9RpbGavEx3azI4NKMy+rY85H+0S4F12gpnlAYuAM2OKPwOOidn+nVFy7t0JwMYQ82TgWkkKy7rsZNvOOefcPsln1nbORUTndsV6HrjYzK6WlA3MklQAvA7cQpQ8jQ7nkn0HnGdmb0rqBHwQcpXNwEDgcOBPkopD3auBhsDL4Xw3AdcniGsQ8HBIBD8FLt2Fbfw90eHZEvcBEyRdAry9k22ul/Q+0Ai4LJTdRXRO2/yQsH3G9kmic845V6PJLP7IlHPVS9euXW3WrFnJDsM555yrEkmzzazCe5D6YVDnnHPOuRTmyZpzzjnnXArzZM0555xzLoV5suacc845l8I8WXPOOeecS2GerDnnnHPOpTBP1pxzzjnnUpgna84555xzKcyTNeecc865FObJmnPOOedcCvP/DeqqveVL8xkyYE6yw9htRo3LYcuWLTz44IPUrl2bgoICzj77bDp06MCECRNYuXIl9erVY9CgQdSvX58XXniBZcuWAbBmzRpOPfVUevXqleStcM45t7t4spbiJG02swZxZVcB+Wb2ZDnrPQoMN7NFuyGGImAB0T+QLwKuMbP3K1jnfTPrsat911T77bcfN9xwA7Vq1WLdunU8+uijnHXWWRQUFHDjjTcyffp03nrrLfr27cu5555but7dd99Nly5dkhi5c8653c2TtWrIzB6uRJ1f7MYut5hZNoCkU4E/ACdV0L8narsgLe37MxS2bNnCQQcdxJIlSzjqqKMAOProo3nvvfe2W2f58uU0bNiQJk2a7NVYnXPO7Vl+zlo1JOl2STdJ6iRpRkx5pqT54fkUSV3D882Sfi/pI0nTJR0QytuG1zMl3SlpcyW6bwSsj+nzl2H9+ZLuiCnfHH72DLE8JylP0jhJCstOD2XTJD0gaVIoP0nSvPCYK6nhro9a9bNhwwbuv/9+Ro4cSXZ2Nt9++y0ZGRkA1KtXj/z8/O3qz5gxg+7duycjVOecc3uQJ2vVmJl9DNSV9INQdAEwIUHV+sB0M+sMTAUuD+V/Af5iZt2AVeV0VS8kTnnAo8BdAJJ6A+2A7kA2cIykExOs3wUYBmQBPwCOl5QOjAb6mNkJQIuY+jcBQ8Js3o+ALeXEts9q0qQJN954I7/61a8YP3489evXL03QtmzZUpq4ARQXFzN//nyys7OTFa5zzrk9xJO16m8CcH54fgEwPkGdAmBSeD4byAzPjwMmhudPl9PHFjPLNrOOwGnAk2F2rHd4zAXmAB2Jkrd4M8xspZkVA/NC/x2BT81saajzTEz9fwPDJQ0FmphZYXyDkq6QNEvSrC1b18cvrva+++670ufp6emkp6fTrl07cnNzAcjNzaVdu++HOi8vjzZt2lCvXr29Hqtzzrk9y89Zq/7GAxMlvQCYmS1JUOc7M7PwvIhdeN/N7ANJzYlmwgT8wcxGV7DatpjnJf2rnD7ukfQacDowXdLJZpYXV2cMMAagZbMsS9BMtbZ69Wqee+45JFFcXEz//v1p3749CxYs4P777yc9PZ1BgwaV1vdDoM45t+/yZK2aM7P/hqs1byPxrFp5pgP9wnoXVmYFSR2BWsBXwGTgLknjzGyzpIOIEsO1lWgqD/iBpEwz+4xoVrCkj7ZmtgBYIOk4olm4vMTN7JvatGnDDTfcsEP5hRcmfpsGDx68hyNyzjmXLJ6spb4MSStjXg9PUGc88CfgsCq2PQx4StKNwGvAxjLq1ZM0LzwXMMjMioA3JXUCPgjXDGwGBgIVJmtmtkXS/wD/kLQOmBGzeJikXkSzcIuAN6q4Xc4559w+Q98fHXM1jaQMovPRTNKFwEVmdvZe7L9BmJETMApYYmYjqtpOy2ZZdt5pT+3+AJNk1LicZIfgnHNuL5A028y6VlTPZ9ZqtmOAB0OytAG4bC/3f7mkQUBdoosUKjr3LaE2h2V4guOcc26f5claDWZm7wGdk9j/CKDKM2nOOedcTeK37nDOOeecS2GerDnnnHPOpTBP1pxzzjnnUpgna84555xzKcyTNeecc865FObJmnPOOedcCvNkzTnnnHMuhXmy5pxzzjmXwvymuK7aW740nyED5iQ7jJ02alwOK1asYPz48aSlpZGWlsbAgQMBePLJJwn/d5XBgwfTtGlTCgoKmDBhAl999RXFxcVceeWVZGRkJHMTnHPO7UGerDmXAho1asQ111xDeno6CxcuZNKkSTRs2JAePXrwwx/+kA8++IApU6Zwzjnn8Nprr5GTk0NWVlayw3bOObcX+GHQakrSwZJelrRE0n8l/UVS3STG01dSVszrOyWdnKx4qpvGjRuTnp4OQO3atUlLS6N169bk5+cDkJ+fT8OGDQH45JNPWLRoESNGjGDSpElJi9k559ze4claNRT+8foLwEtm1g5oDzQAfp/EsPoCpcmamf3WzP6ZxHiqpW3btvHKK69wyimn0KFDB6ZNm8bdd9/Ne++9x/HHHw/AqlWr6NChA8OGDWP16tXk5uYmOWrnnHN7kidr1dOPga1m9gSAmRUB1wOXSaov6T5JCyTNl3QtgKRukt6X9JGkGZIaShos6cGSRiVNktQzPN8s6X5JcyT9S1KLUH65pJmhneclZUjqAfwU+JOkeZLaShorqX9Y5yeS5oaYHpe0Xyj/TNIdoY8FkjqG8pNCO/PCeg331sAmU1FREY899hi9e/emVatWvPTSS5x11ln85je/4YwzzuDll18GICMjg6ysLCSRlZXF559/nuTInXPO7UmerFVPRwCzYwvM7BtgOfAL4DCgi5kdDYwLh0fHA9eZWWfgZGBLBX3UB+aYWQ7wLvC7UP6CmXUL7XwM/NzM3gdeAX5pZtlm9t+SRiSlA2OBC8zsKKLzJK+O6Wdd6OMh4KZQdhMwxMyygR8lilXSFZJmSZq1Zev6CjYl9RUXF/PEE0/QuXNnsrOzS8sbNGgAQMOGDUsPibZv357ly5cDsGzZMlq0aLH3A3bOObfX+AUG1ZMAK6P8ROBhMysEMLOvJR0FrDazmaHsG6D0KsMyFBMleABPER12BThS0t1AE6JDr5MriLUDsNTMFofXfwOGAH8Or0vanQ2cG57/GxguaRxRcrgyvlEzGwOMAWjZLCvRWFQr8+bNIzc3l02bNjFjxgxat27NaaedxjPPPENaWhpFRUVcfPHFAJx99tmMGzeOwsJCWrRoQefOnZMcvXPOuT3Jk7XqKRfoF1sgqRFwCPApOyZyZSV3hWw/u5peTp8l648F+prZR5IGAz0riLXcjBDYFn4WEfZHM7tH0mvA6cB0SSebWV4F7VRrOTk55OTk7FB+44037lDWrFkzhg4dujfCcs45lwL8MGj19C8gQ9LPACTVAu4nSqTeBK6SVDss2x/IA1pL6hbKGoblnwHZktIkHQJ0j+kjDegfnl8MTAvPGwKrJdUBBsTU3xSWxcsDMiUdHl5fQnRYtUyS2prZAjO7F5gFdCyvvnPOObcv82StGjIzA84BzpO0BFgMbAVuAR4lOndtvqSPgIvNrAC4ABgZyt4imkX7N7AUWADcB8TeWfZb4AhJs4kuaLgzlN8GfBjaiJ3tehb4ZbggoG1MrFuBS4GJkhYQHV59uIJNHCZpYYh1C/BGpQfHOeec28co+t53bnuSNptZg2THURldu3a1WbNmJTsM55xzrkokzTazrhXV85k155xzzrkU5smaS6i6zKo555xz+zpP1pxzzjnnUpgna84555xzKcyTNeecc865FObJmnPOOedcCvNkzTnnnHMuhXmy5pxzzjmXwjxZc84555xLYf6P3F21t3xpPkMGzKm4YgoYNS6HFStWMH78eNLS0khLS2PgwIEsWbKE119/nf333x+ASy+9lCZNmlBQUMCECRP46quvKC4u5sorryQjIyPJW+Gcc25v8mTNub2sUaNGXHPNNaSnp7Nw4UImTZpEhw4d6NGjB3369Nmu7muvvUZOTg5ZWVlJitY551yyebJWw0g6B3gB6GRmeRXVL6edwcCbZrZqd8VWUzRu3Lj0ee3atUlLi85G+PDDD1m0aBHt27fnjDPOIC0tjU8++YSioiImT55Mu3btOPPMM5MVtnPOuSTxc9ZqnouAacCFu9jOYKD1LkdTg23bto1XXnmFU045hc6dO/Pb3/6W66+/nq+++oqZM2cCsGrVKjp06MCwYcNYvXo1ubm5SY7aOefc3ubJWg0iqQFwPPBzYpI1ST0lTZH0nKQ8SeMkKSz7raSZkhZKGqNIf6ArME7SPEn1JP1E0lxJCyQ9Lmm/sP5nkv5P0geSZknKkTRZ0n8lXRXq/F3S2THxjJP00704NHtdUVERjz32GL1796ZVq1ZkZGSUnsPWtWtXli1bBkBGRgZZWVlIIisri88//zzJkTvnnNvbPFmrWfoC/zCzxcDXknJilnUBhgFZwA+IkjqAB82sm5kdCdQDzjSz54BZwAAzywYMGAtcYGZHER1evzqm7RVmdhzwXqjXH/ghcGdY/ihwKYCkxkAP4PXduN0ppbi4mCeeeILOnTuTnZ0NQH5+funyTz75hAMOOACA9u3bs3z5cgCWLVtGixYt9n7AzjnnksrPWatZLgL+HJ4/G16XXEY5w8xWAkiaB2QSHS7tJelmIAPYH8gFXo1rtwOwNCSBAH8DhsT09Ur4uQBoYGabgE2StkpqYmbvSholqSVwLvC8mRWWtyGSrgCuAGiQcWAVhiD55s2bR25uLps2bWLGjBm0bt2a9PR0PvnkE9LS0jjggAM44YQTADj77LMZN24chYWFtGjRgs6dOyc5euecc3ubJ2s1hKRmwI+BIyUZUAuwkIgBbIupXgTUlpQO/BXoamYrJN0OpCdqvoLuS9oujuunmO/3wb8DA4gOz15W0faY2RhgDEDLZllWUf1UkpOTQ05OTsUVgWbNmjF06NA9HJFzzrlU5odBa47+wJNmdqiZZZrZIcBS4IRy1ilJzNaF8936xyzbBDQMz/OATEmHh9eXAO9WMb6xRIdhMTM/i94555wLPFmrOS4CXowrex64uKwVzGwD8AjR4cuXgJkxi8cCD4dDpiI652yipAVEM2YPVyU4M1sDfAw8UZX1nHPOuX2dzKrVESS3j5KUQZQU5pjZxqqs27JZlp132lN7JrDdbNS4yh3+dM45t++TNNvMulZUz89Zc0kn6WTgcWB4VRM1gDaHZXgS5Jxzbp/lyZpLOjP7J9Am2XE455xzqcjPWXPOOeecS2GerDnnnHPOpTBP1pxzzjnnUpgna84555xzKcyTNeecc865FObJmnPOOedcCvNkzTnnnHMuhXmy5pxzzjmXwvymuK7aW740nyED5iQ1hlHjchg5ciQrVqygV69e9OnTh4KCAv72t7+xefNmMjIyuOSSS8jIyKC4uJgXX3yRlStXUlxczIUXXkirVq2SGr9zzrnU5cnaPkpSEdH/2qxN9A/SBwEtgUlmduQutn0VkG9mT5ZTZ2zo67ld6as6GThwIHl5eWzYsAGAadOm0aZNG0499VRmzZrFW2+9xdlnn820adM44IAD6NevX5Ijds45Vx34YdB91xYzyw6JWQFw1e5q2MweLi9Rq6maNm263es1a9Zw6KGHApCZmcnixYsBmDNnDl999RUjRoxg/PjxFBYW7vVYnXPOVR+erNUM7wGHh+e1JD0iKVfSm5LqSWorqfQ4oqR2kmaH5/dIWiRpvqT7Qtntkm4Kz7MlTQ/LX5TUNL5zST+RNFfSAkmPS9ovlJ8uKU/SNEkPSJokKU3SEkktQp00Sf+R1HzPDtHud9BBB5GbmwtAbm4u+fn5AGzYsIHGjRtz/fXXU7t2bd5///1khumccy7FebK2j5NUG+hDdEgUoB0wysyOADYA/czsv8BGSdmhzqXAWEn7A+cAR5jZ0cDdCbp4EvhVWL4A+F1c/+nAWOACMzuK6LDs1aF8NNDHzE4AWgCYWTHwFDAgNHEy8JGZrdu1kdj7evToQWFhISNGjChN0ADq/397dx5mRXWtcfj3MQkNCIiiBkVQiV4QBcUJ1KCJKHHAMaBEwWiIRo2iRs2oxgzeGKNxSAwhBI1EVFTkOmIcUCMOoAziAAYFUQQnnKCBbtb9o3bjoTk9QUMfmu99nn66zqpdu1bVOTbLvatONW9Oly5dAOjatSvvvvtuXaZpZmYFzsVa/dVM0lRgMjAP+HuKvxURU9PyFKBjWh4BnCapITAA+BfwGVAMjJB0HLAkdweSWgGtI2JiCt0CHFQuj13SPmeVa7MrMCci3krx23O2GQmcmpa/B/yj/MFJGippsqTJS4s/qfRE1JVGjRoxYMAAhg0bRtu2benRowcAnTt3Zt68eQDMnTuXdu3a1WWaZmZW4HyDQf21NCK65wYkASzLCZUCzdLy3WSjYo8DUyLio7TNPsA3gYHAOcAhNcxDNYwTEe9IWijpEGBfvhply20zHBgO0K5tl6hhTuvF6NGjmTNnDiUlJcydO5f+/fszZswYGjRoQPv27Tn22GMB6Nu3L7feeitPP/00RUVFDBkypG4TNzOzguZizQCIiGJJjwB/AU4HkNQCKIqIByU9B7xZbptPJX0i6cCIeBo4BZhYruvXgY6Sdo6IN3PavA7sKKljRLxNNpqXawTZdOg/I6K0Vg92PRk0aI2akmHDhq0RKyoq4swza+1+DzMzq+dcrFmu0cBxwIT0uiVwX7q+TMCalUf2lSA3SyoC5pBd77ZKKgJPA+5K18+9CNwcEcsk/RB4WNKHwAvl+h1PNv25xhSomZnZpsTFWj0VES3yxN4Gdst5/YdyTQ4ARpaNZEXEAmCfPP1cnrM8FdgvT5shOcuPAT3ypPlEROyqbH72JrLr68rsQXZjwet5tjMzM9tkuFgzACTdC+xEza9JWxfflzQYaAK8THZ3KJIuBc4iz7Vq+XToVMRNo/dcb0mamZnVJUUUxLXZZmutZ8+eMXny5KobmpmZFRBJUyKiZ1Xt/NUdZmZmZgXMxZqZmZlZAXOxZmZmZlbAXKyZmZmZFTAXa2ZmZmYFzMWamZmZWQFzsWZmZmZWwFysmZmZmRUwP8HANnrz3lrC2YNeWi993zR6T2644QbeeecdDj74YPr168esWbMYP348DRo0QBKDBw9miy22WLXNLbfcwuLFiznvvPPWS05mZrZpcbFmVoXvfve7vP766yxevBiAHXfckYsuugiAZ599lieffJLjjjsOgHfffZelS5fWWa5mZlb/eBrU8pK0naT7JM2W9F9Jf5LURNIQSTdWsM2za7mvYyR1WbeM1582bdqs9rpRo6/+H6e4uJj27duvev3ggw9y2GGHbbDczMys/nOxZmuQJOAeYFxEdAa+DrQAflPZdhHRay13eQxQsMVaPjNmzOCqq67iqaeeolOnTgDMmjWLdu3asfnmm9dxdmZmVp+4WLN8DgGKI+IfABFRCgwDvgcUAdtLeljSG5IuK9tI0hc5yz+W9KKk6ZKuyImfmmLTJP1TUi/gaOBqSVMl7STpR5JeTe3GbKBjrpFu3bpx6aWXctRRRzF+/HgAJkyYwKGHHlrHmZmZWX3ja9Ysn67AlNxARHwmaR7ZZ2YfYDdgCfCipAciYnJZW0l9gc6pnYDxkg4CPgJ+BvSOiA8lbRERH0saD9wfEWPT9pcCnSJimaTW+RKUNBQYCtCiaJvaPPYqrZ66e+MAACAASURBVFixgsaNGwPQrFkzmjRpQnFxMZ999hkjR45k+fLlLFiwgIceeoh+/fpt0NzMzKz+cbFm+QiISuKPRsRHAJLuAQ4AJue065t+Xk6vW5AVb3sAYyPiQ4CI+LiC/U8HRksaB4zL1yAihgPDAdq17ZIv11ozevRo5syZQ0lJCXPnzqVbt248//zzSKJRo0acfPLJNG3alJ/+9KcAfPTRR9x2220u1MzMrFa4WLN8ZgLH5wYkbQ5sD5SyZiFX/rWA30XEX8v18aM8bfM5AjiIbHr0F5K6RkRJ9dOvXYMGDVoj1rt37wrbt23b1l/bYWZmtcbXrFk+jwFFkk4FkNQQuAYYRTb1eaikLSQ1I7s54D/ltn8E+J6kFmn79pLapX6/I6ltipd9OdnnQMsUawBsHxFPABcDrclG5szMzDZJLtZsDRERwLHAiZJmA7OAYuCnqckzwD+BqcDdOderRdp+AvAvYJKkGcBYoGVEzCS7o3SipGnAH9N2Y4AfS3qZbLr0trTdy8C1EbF4vR6wmZlZAVP277LZukmjZS9FxA4bet89e/aMyZMnV93QzMysgEiaEhE9q2rnkTVbZ5K+BkwC/lDXuZiZmdU3vsHA1llEvEf2xblmZmZWyzyyZmZmZlbAXKyZmZmZFTAXa2ZmZmYFzMWamZmZWQFzsWZmZmZWwFysmZmZmRUwF2tmZmZmBczfs2ZWhRtuuIF33nmHgw8+mH79+jFr1izGjx9PgwYNkMTgwYPZYosteOyxx5g2bRorV65kyy235JRTTqFhw4Z1nb6ZmW3k/Lgp2+i1a9slTjz8tvXS902j9+STTz7h9ddfZ/HixfTr14+SkhIaNcr+P+fZZ5/l/fff57jjjlstPmrUKPbaay+6deu2XvIyM7ON3yb5uClJx0oKSbtWo+2Q9JikstcjJHWpYptn0++Okk7OifeUdP265F4bJL0tacsK4jMkTU2/+1ejry9qKaczJZ1aG33VlTZt2qz2uqwgAyguLqZ9+/arxSOCiKBdu3YbLkkzM6u36ts06EnAM8BA4PIq2g4BXgHeA4iIM6rqPCJ6pcWOwMnAv1J8MlDoTxI/OCI+lLQLMAG4b0PsNCJurkl7SY0iomR95VNbZsyYwQMPPEBxcTE//OEPV8UfeughnnvuOdq1a7dGkWdmZrY26s3ImqQWQG/gdLJiLXfdxWlEaZqkqySdAPQERqfRpmaSnkwjZGdJ+n3OtkMk3ZCWy0abrgIOTNsOk9RH0v2pTXNJIyW9KOnlslEsSV0lvZC2mS6pc55j+IukyZJmSroiJ/62pCskvZSOY9cUbytpQtrPXwFV41RtDnyS0/d3c/L6q6SGOet+k87Zc5K2TrGjJD2f9vlvSVtLapBybJ2z7Ztp3eWSLkqx7qmv6ZLuldQmxZ+U9FtJE4HzJJ0o6ZW076eqcUwbXLdu3bj00ks56qijGD9+/Kp4v379uPzyy2nbti2TJk2qwwzNzKy+qDfFGnAM8HBEzAI+lrQngKR+ad2+EbEH8PuIGEs2EjYoIrpHxNKcfsYCx+W8HgDcUW5flwJPp22vLbfuZ8DjEbE3cDBwtaTmwJnAnyKiO1mhOD/PMfwszV3vDnxD0u456z6MiD2BvwAXpdhlwDMR0QMYD3So5Pw8IekVYCLw83Ru/icdX++UVykwKLVvDjyXztlTwPdT/Blgv7TPMcDFEbGSbKTu2NTvvsDbEbGwXA63ApdExO7AjJR/mdYR8Y2IuAb4JXBY2vfRlRxTnVixYsWq5WbNmtGkSZPV4pJWi5uZma2L+jQNehJwXVoek16/BHwL+EdELAGIiI8r6yQiPpA0R9J+wGxgF+A/NcijL3B02WgS0JSsiJoE/EzSdsA9ETE7z7bfkTSU7H3ZFugCTE/r7km/p/BVMXlQ2XJEPCDpEypWNg26E/CYpCeBbwJ7AS9KAmgGLErtlwP35+zz0LS8HXCHpG2BJsBbKX4HWZH1D7KRzdUKXEmtyAqyiSl0C3BXTpPc9v8BRkm6M+e4V5PO01CAFkXbVHLY62706NHMmTOHkpIS5s6dS7du3Xj++eeRRKNGjTj55OzyxbvvvpsFCxYQEWy11VYceeSR6zUvMzPbNNSLYk1SW+AQYDdJATQEQtLFZFODNb3l9Q7gO8DrwL1Rs1tmBRwfEW+Ui78m6XngCOARSWdExOM5x9CJbMRs74j4RNIoskKvzLL0u5TV37caHVtE/FfSQrJCUMAtEfGTPE1X5Bx37j5vAP4YEeMl9eGrawMnATtL2opsJPPXNckL+DInxzPT6NwRwFRJ3SPio3LHMRwYDtndoDXcV40MGjRojVjv3r3XiA0cOHCNmJmZ2bqqL9OgJwC3RsQOEdExIrYnG/E5gOxi+u9JKgKQtEXa5nOgZQX93UNWcJzEmlOgVW37CHCu0lCVpB7p947AnIi4nmzKcvdy221OVrB8mq4P61f5IQPZ9OSg1H8/oMor2iW1AzoBc4HHgBNSDElbSNqhii5aAe+m5cFlwVTY3Qv8EXgtT3H1KfCJpANT6BSyKdl8Oe4UEc9HxC+BD4HtqzouMzOz+qpejKyRFVVXlYvdDZwcEWdJ6g5MlrQceBD4KTAKuFnSUmD/3A3TyNarQJeIeCHP/qYDJZKmpX5ezll3Jdl07PRUsL0NHEl2bdh3Ja0A3gd+VW6f0yS9DMwE5lC9qdcrgNslvURW+MyrpO0TkkqBxsCl6XqyhZJ+DkyQ1ABYAZxNVshV5HLgLknvAs+RFX5l7gBeJLvTNp/BZOe8iOwYT6ug3dXpBgyRFZTTKsnHzMysXvOX4tpGb31/Ka6Zmdn6oGp+KW59GVmzTViHTkUuqszMrN6qL9esmZmZmdVLLtbMzMzMCpiLNTMzM7MC5mLNzMzMrIC5WDMzMzMrYC7WzMzMzAqYizUzMzOzAuZizczMzKyAuVgzMzMzK2B+goFt9Oa9tYSzB720VtuWPfngjjvuYO7cuaxcuZJvfvObdO3alREjRlBSUsLKlSsZOHAg2223XW2mbWZmVi0u1myT99577/Hee+9x8cUXU1xczG9/+1u+/PJLdtppJ4444ghmzZrFww8/zBlnnFHXqZqZ2SbI06C2GkmlkqZKmilpmqQLJK3V50RST0nX13aOta1Vq1Y0atSI0tJSiouLKSoqYptttqG4uBiAL7/8kpYtW9ZxlmZmtqnyyJqVtzQiugNIagf8C2gFXFbTjiJiMjC5dtOrfUVFRbRr147LL7+cZcuWMWjQIDp06MD999/PlVdeydKlS7nwwgvrOk0zM9tEeWTNKhQRi4ChwDnKNJR0taQXJU2X9AMASXdI+nbZdpJGSTpeUh9J96dYC0n/kDQjbXt8iveVNEnSS5LuktQixa+S9Gpq+4f1eZyvvfYaixcv5oorruCyyy5j/PjxTJgwgR49evCLX/yCM844gzFjxqzPFMzMzCrkkTWrVETMSdOg7YD+wKcRsbekzYD/SJoAjAEGAA9KagJ8EzgL2Denq1+kbbsBSGojaUvg58C3IuJLSZcAF0i6ETgW2DUiQlLr8nlJGkpWSNKiaJt1Ps6ioiIaNGhA06ZNKSkpoaSkhObNmwPQsmVLlixZss77MDMzWxsu1qw6lH73BXaXdEJ63QroDDwEXJ8KuMOBpyJiqaTcPr4FDCx7ERGfSDoS6EJW9AE0ASYBnwHFwAhJDwD3l08oIoYDwwHate0S63Jwu+66K5MnT+aaa65hxYoV9OnThx49enDLLbcwadIkVqxYwTHHHLMuuzAzM1trLtasUpJ2BEqBRWRF27kR8Uiedk8Ch5GNsN2eryugfFEl4NGIOClPf/uQjdANBM4BDln7o6hcgwYNOPXUU9eIn3feeetrl2ZmZtXma9asQpK2Am4GboyIAB4BzpLUOK3/uqTmqfkY4DTgwNSuvAlkRVdZ322A54DeknZOsaLUZwugVUQ8CJwPdF8vB2hmZrYR8MialddM0lSgMVAC/BP4Y1o3AugIvKRs3vIDoGx+cAJwKzA+Ipbn6ffXwE2SXiEbqbsiIu6RNAS4PU2hQnYN2+fAfZKako2+DavdQzQzM9t4KBswMdt49ezZMyZPLvhvCDEzM1uNpCkR0bOqdp4GNTMzMytgLtbMzMzMCpiLNTMzM7MC5mLNzMzMrIC5WDMzMzMrYC7WzMzMzAqYizUzMzOzAuZizczMzKyAuVgzMzMzK2B+3JRt9Oa9tYSzB71UZbubRu+ZtZ83j3HjxlFaWsoOO+zAsccey5133sn8+fNp1qwZgwcPpnnz5lX0ZmZmtmG4WLNNSklJCePGjWPo0KE0bdoUgJkzZ7J8+XIuvPBCnnvuOR599FGOOeaYKnoyMzPbMKqcBpW0taR/SZojaYqkSZKO3RDJ1SZJfSTdX0E8JJ2eE+uRYhdV0ecoSSfUcp5PSqryOWGVbL9L6mOqpNckDa/N/DZ2c+bMYbPNNmPkyJFcd911vPnmm8yePZtu3boBsPvuuzN79uw6ztLMzOwrlRZrkgSMA56KiB0jYi9gILDdhkhuA5oBDMh5PRCYVke51IikhuVC1wPXRkT3iPgf4IY6SKtgffrpp8yfP5/TTjuNIUOGMHr0aL788kuKiooAaNasGUuWLKnjLM3MzL5S1cjaIcDyiLi5LBARcyPiBsgKBUlXS3pR0nRJP0jxPml0Z6yk1yWNToUfkvaSNDGN0j0iadsU/5GkV1M/Y8onIqmjpKclvZR+elVjX4en2DPAcZUc5zygaRpFFHA48FDOvrtLei7ldq+kNnnyq+i4dpb0b0nTUt47lR/lk3SjpCF5+vyLpMmSZkq6Iif+tqRfpuM6sdxm2wLzc96vGVW8V0r7f1XSA5IeLBstTPvZMi33lPRkWm4uaWTq62VJ/VN8iKR7JD0sabak3+fkfHg6/mmSHquin66SXkijg9Mlda7kvauR5s2bs+OOO9KsWTNat25NixYtWLly5aoCbenSpasKNzMzs0JQVbHWFajsyu3TgU8jYm9gb+D7kjqldT2A84EuwI5Ab0mNyUZ6TkijdCOB36T2lwI9ImJ34Mw8+1oEHBoRe5KNgl2fsy7fvpoCfwOOAg4EtqniWMeSFT690jEvy1l3K3BJym0GcFnuhlUc12jgpojYI/W9oIo8cv0sInoCuwPfkLR7zrriiDggIsoXttcCj0t6SNIwSa1TvKL36lhgF6Ab8P2UY5V5AY+nvg4GrpZUdkV+d7L3pxswQNL2krYiey+OT+fhxCr6ORP4U0R0B3qSU3yWkTQ0FbKTlxZ/Uo2UMx07dmTRokWUlpZSXFzM559/Tvfu3Zk5cyaQXb/WuXOt1YZmZmbrrEY3GEi6CTiAbLRtb6AvsLu+um6rFdAZWA68EBHz03ZTgY7AYmA34NE0+NWQr4qX6cBoSePIpl7LawzcKKk7UAp8PWddvn19AbwVEbNT/DZgaCWHdydwB7ArcDupaJHUCmgdERNTu1uAu8ptu0u+45LUEmgfEfcCRERx6rOSNFbzHUlDyd6nbcmK0elp3R35NoiIf0h6hGx0sD/wA0l7UPF7dRBwe0SUAu9JerwaefUFjtZX1/Q1BTqk5cci4tN0nK8COwBtyKbS30o5flxFP5OAn0naDrin7D0sd5zDgeEA7dp2iWrkDEBRURF9+vThuuuuo7S0lGOOOYauXbsyc+ZMrrnmGpo2bcrgwYOr252Zmdl6V1WxNhM4vuxFRJydpsUmp5CAcyPikdyNJPVh9ZGp0rQvATMjYv88+zqCrHA4GviFpK4RUZKzfhiwENiDbESwOGddvn0BVPsf8Yh4X9IK4FDgPKo3wlQm73FJ2ryC9iWsPqrZdI0Os1Gvi4C9I+ITSaPKtfuyomQi4j2y0b2Rkl4hKyQreq++TcXnKTfP3H2LbJTsjXJ97UvF73u+feTtB3hN0vNkn4lHJJ0REdUpIqtl3333Zd99910tNnDgwNrq3szMrFZVNQ36ONm1XGflxHIv6HkEOCtNAyLp6znTYfm8AWwlaf/UvnG6PqkBsH1EPAFcDLQGWpTbthWwICJWAqeQjV5V5nWgk6Sd0uuTqmgP8Euy6c7SskAaJfpE0oEpdAowsdx2eY8rIj4D5ks6JsU3k1QEzAW6pNetgG/myWVzsoLsU0lbA/2qkX/ZtWFl78c2QFvgXSp+r54CBqZr2rYlm44s8zawV1o+Pif+CHCutOrawB5VpDWJbBq3U2q/RWX9SNoRmBMR1wPjyaaBzczMNkmVjqxFRKRC41pJFwMfkBUQl6QmI8imHF9K/+B+AFT4BVURsTxNw12fipRGwHXALOC2FBPZ3YyLy23+Z+BuSScCT1DJyFLaV3GaQnxA0ofAM2QjTJVt82wFqwYDN6dCaw5wWjWPayZZcfdXSb8CVgAnRsQcSXeSTWnOBl7Ok8s0SS+nPuYA/6ks9xx9gT9JKht5/HEaNazovbqX7EaSGWTvQ24hegXwd0k/BZ7PiV+Zjm966utt4MiKEoqID9J7cU8qzBeRjWBW1M8A4LtppPN94FfVPHYzM7N6RxHVnim0TUCabr0/IsbWdS7V1a5tlzjx8NuqbFf2BAMzM7NCIGlKupGwUn6CgW30OnQqciFmZmb1los1W01EDKnrHMzMzOwrVT5uyszMzMzqjos1MzMzswLmYs3MzMysgLlYMzMzMytgLtbMzMzMCpiLNTMzM7MC5mLNzMzMrIC5WDMzMzMrYP5SXNvozXtrCWcPeqnC9blPN1i4cCFXXnkl559/Pq1bt+bWW28lPUeeIUOG0KZNm/Wer5mZWU14ZG0jJ6lU0lRJr0i6S1KRpI6SXqnr3ArRQw89ROfOnQGYOHEivXr1YtiwYey33348+eSTdZucmZlZHi7WNn5LI6J7ROwGLAfOrOuECtXbb7/N5ptvvmr07Gtf+xpLliwBYMmSJbRs2bIu0zMzM8vLxVr98jSwc1puKOlvkmZKmiCpGYCknSQ9LGmKpKcl7ZrioyRdL+lZSXMknZDiknR1GrmbIWlAiveRNFHSnZJmSbpK0iBJL6R2O6V2W0m6W9KL6ad3il8uaaSkJ9P+flR2EJLGpfxmShpaWyfnoYceom/fvqte77LLLjzzzDP8+te/5umnn6Z37961tSszM7Na42KtnpDUCOgHzEihzsBNEdEVWAwcn+LDgXMjYi/gIuDPOd1sCxwAHAlclWLHAd2BPYBvAVdL2jat2wM4D+gGnAJ8PSL2AUYA56Y2fwKujYi9Uw4jcva3K3AYsA9wmaTGKf69lF9P4EeS2q7VSckxY8YMdthhB1q0aLEqNm7cOI466ih+/vOfc8QRR3Dfffet627MzMxqnW8w2Pg1kzQ1LT8N/B34GvBWRJTFpwAdJbUAegF3lV1UD2yW09e4iFgJvCpp6xQ7ALg9IkqBhZImAnsDnwEvRsQCAEn/BSakbWYAB6flbwFdcva3uaSy+cYHImIZsEzSImBrYD5ZgXZsarM9WeH5Ue5BpxG3oQAtirap8iTNnz+fWbNmMWfOHN59913ef/99GjduvKp4a9my5aopUTMzs0LiYm3jtzQiuucGUmG0LCdUCjQjG0ldXL59jtxtVO53Ve1X5rxeyVefrQbA/hGxtBo5NpLUh6zA2z8ilkh6EmhafscRMZxslJB2bbtEJTkC0K9fP/r16wfArbfeSq9evSgqKuL222+nQYMGlJaWcvLJJ1fVjZmZ2QbnYm0TEhGfSXpL0okRcZeyimn3iJhWyWZPAT+QdAuwBXAQ8GOyKczqmACcA1wNIKl7zohfPq2AT1KhtiuwXzX3U22nnnrqquULL7ywtrs3MzOrVb5mbdMzCDhd0jRgJtC/ivb3AtOBacDjwMUR8X4N9vcjoKek6ZJepeq7VR8mG2GbDlwJPFeDfZmZmdU7iqhyBsmsoLVr2yVOPPy2CtfnfimumZlZoZA0JSJ6VtXO06C20evQqcgFmZmZ1VueBjUzMzMrYC7WzMzMzAqYizUzMzOzAuZizczMzKyAuVgzMzMzK2Au1szMzMwKmIs1MzMzswLmYs3MzMysgLlYMzMzMytgfoKBbfTmvbWEswe9lHfdTaP3ZOnSpdx44400atSI5cuX079/f3bccUduueUWvvjiC4qKijjllFMoKirawJmbmZlVzcWa1XubbbYZF1xwAQ0bNuTDDz9kxIgR7LPPPnTo0IHDDjuMyZMn8+ijj9K/f1XPtDczM9vwPA1aSyRtLelfkuZImiJpkqRj6zqv8iQNkXRjBeselNS6hv3dJ2lS7WS3fjRo0ICGDRsCsHTpUtq3b8/ChQvZYYcdAOjYsSOzZs2qyxTNzMwq5JG1WiBJwDjglog4OcV2AI5ez/ttGBGltdVfRHy7hvtvDewJfCGpU0S8ladNo4goqa0c19bixYv5+9//zsKFCznllFP45JNPmDlzJrvuuiszZ85kyZIldZ2imZlZXh5Zqx2HAMsj4uayQETMjYgbICuqJF0t6UVJ0yX9IMWV4q9ImiFpQIo3kPRnSTMl3Z9GvE5I696W9EtJzwAnSvp+6neapLslFaV2oyTdLOlpSbMkHZmT79ckPSxptqTflwVT31um5VNTrtMk/bOC4z4e+D9gDDAwp59Rkv4o6QngfyU1lzQy5fmypP6pXceU30vpp1eKbyvpKUlT07k5cF3eHIDWrVtz4YUXcskll3DHHXfQq1cvSkpKuPbaa1m8eDGtWrVa112YmZmtFx5Zqx1dgfxXuGdOBz6NiL0lbQb8R9IEslGp7sAewJbAi5KeAnoDHYFuQDvgNWBkTn/FEXEAgKS2EfG3tPzrtK8bUruOwDeAnYAnJO2c4t2BHsAy4A1JN0TEO2WdS+oK/AzoHREfStqiguM6CbgCWAiMBX6Xs+7rwLciolTSb4HHI+J7aTTuBUn/BhYBh0ZEsaTOwO1AT+Bk4JGI+I2khsAaV/5LGgoMBWhRtE0F6WVWrFhB48aNAWjatClNmzalUaNGDBgwAIBnnnmG1q1rNPtrZma2wbhYWw8k3QQcQDbatjfQF9i9bHQMaAV0Tm1uT1OZCyVNBPZO8bsiYiXwfhqhynVHzvJuqUhrDbQAHslZd2fqY7akOcCuKf5YRHyacn0V2AF4J2e7Q4CxEfEhQER8nOcYtwZ2Bp6JiJBUImm3iHglNbkrZ4q2L3C0pIvS66ZAB+A94EZJ3YFSsgIP4EVgpKTGwLiImFp+/xExHBgO0K5tlyi/PteCBQsYO3Yskli5ciUnnHACCxYsYMyYMTRo0ID27dtz7LEFd3mhmZkZ4GKttswkmxIEICLOTtOJk1NIwLkRkVtIIamia8RUxf6+zFkeBRwTEdMkDQH65KwrX8SUvV6WEytlzc+B8mxb3gCgDfBWdskem5NNhf48T44Cjo+IN1bbiXQ52ajcHmRT8sUAEfGUpIOAI4B/Sro6Im6tIp8KdejQgQsuuGCN+LBhw9a2SzMzsw3G16zVjseBppLOyonlTt09ApyVRoqQ9HVJzYGngAHpmratgIOAF4BngOPTtWtbs3oBVl5LYEHqe1C5dSemPnYCdgTeWGPr/B4DviOpbco33zToScDhEdExIjoCe5Fz3Vo5jwDnphsxkNQjxVsBC9Lo3ylAw7R+B2BRmt79O9l0sZmZ2SbJI2u1IE0DHgNcK+li4AOykaVLUpMRZNePvZQKlg+AY4B7gf2BaWQjWRdHxPuS7ga+CbwCzAKeBz6tYPe/SOvnAjPIircybwATga2BM9O1YdU5npmSfgNMlFQKvAwMKVsvqSPZNOZzOdu8JekzSfvm6fJK4Dpgejr+t4EjgT8Dd0s6EXiCr0bj+gA/lrQC+AI4tcqkzczM6ilFVDXbZXVBUouI+CKNbr1AdrH/+zXYfhRwf0SMXV85FoqePXvG5MmTq25oZmZWQCRNiYieVbXzyFrhuj/dOdkEuLImhZqZmZnVHy7WClRE9FnH7YfUTiZmZmZWl3yDgZmZmVkBc7FmZmZmVsBcrJmZmZkVMBdrZmZmZgXMxZqZmZlZAXOxZmZmZlbAXKyZmZmZFTB/z5rVW0uXLuXGG2+kUaNGLF++nP79+9O2bVtGjBjBokWLOPvss9l5553rOk0zM7NK+XFTttFr17ZLnHj4bavFbhq9JytXriQiaNiwIR9++CEjRozgggsuYMWKFdx999306tXLxZqZmdWZ6j5uytOgBUrSdpLukzRb0n8l/UlSk/W8z56Srl+L7TpKOnld+6ltDRo0oGHDhkA2yta+fXuaNGlC8+bN6zgzMzOz6nOxVoAkCbgHGBcRnYGvAy2A35RrV6vT2BExOSJ+tBabdgRWFWvr0E+tW7x4Mddccw033HAD3bt3r+t0zMzMaszFWmE6BCiOiH8AREQpMAz4nqQfSrpL0v8BEyQ1kPRnSTMl3S/pQUknAEj6paQXJb0iaXgqApH0pKT/lfSCpFmSDkzxPpLuT8sPSpqafj6VNDiNoD0t6aX00yvlexVwYGo7rFw/W0gaJ2m6pOck7Z7il0samXKZI+lHKd5c0gOSpqW8B6zLiWzdujUXXnghl1xyCXfccce6dGVmZlYnXKwVpq7AlNxARHwGzCO7KWR/YHBEHAIcRzay1Q04I60rc2NE7B0RuwHNgCNz1jWKiH2A84HLyicQEd+OiO7A6cBcYBywCDg0IvYEBgBlU52XAk9HRPeIuLZcV1cAL0fE7sBPgVtz1u0KHAbsA1wmqTFwOPBeROyR8n640jNViRUrVqxabtq0KU2bNl3brszMzOqM7wYtTALy3flRFn80Ij5OsQOAuyJiJfC+pCdy2h8s6WKgCNgCmAn8X1p3T/o9hazYW3Nn0pbAP4HvRMSnkloBN0rqDpSSTc9W5QDgeICIeFxS29QPwAMRsQxYJmkRsDUwA/iDpP8F7o+IpyvIbSgwFKBF0TZ5d7xgwQLGjh2LJFauXMkJJ5zA0qVLLh1onwAACUJJREFUGT58OO+//z7vvfceu+22G0ceeWTe7c3MzAqBi7XCNJNU4JSRtDmwPVmR9GXuqnwdSGoK/BnoGRHvSLocyB1aWpZ+l5LncyCpITAG+FVEvJLCw4CFwB5ko7LF1TiWfPmVFaLLcmKlZKN9syTtBXwb+J2kCRHxqzU6iBgODIfsbtB8O+7QoQMXXHDBGvHzzjuvGmmbmZkVBk+DFqbHgCJJp8KqwukaYBSwpFzbZ4Dj07VrWwN9UrysMPtQUgvghBrmcBUwPSLG5MRaAQvSKN4pQMMU/xxoWUE/TwGD0nH0AT5MU7p5SfoasCQibgP+AOxZw7zNzMzqFRdrBSiyL787FjhR0mxgFtko1k/zNL8bmA+8AvwVeB74NCIWA38jm1YcB7xYwzQuAvrm3GRwNNlI3WBJz5FNgZaN8E0HStJNAcPK9XM50FPSdLICcHAV++0GvCBpKvAz4Nc1zNvMzKxe8Zfi1gOSWkTEF5LaAi8AvSPi/brOa0Op6EtxzczMCll1vxTX16zVD/dLag00Aa7clAo1gA6dilycmZlZveVirR6IiD51nYOZmZmtH75mzczMzKyAuVgzMzMzK2Au1szMzMwKmO8GtY2epM+BN+o6j3pmS+DDuk6iHvH5rF0+n7XP57R2Vfd87hARW1XVyDcYWH3wRnVufbbqkzTZ57T2+HzWLp/P2udzWrtq+3x6GtTMzMysgLlYMzMzMytgLtasPhhe1wnUQz6ntcvns3b5fNY+n9PaVavn0zcYmJmZmRUwj6yZmZmZFTAXa7ZRk3S4pDckvSnp0rrOZ2Mh6W1JMyRNlTQ5xbaQ9Kik2el3mxSXpOvTOZ4uyQ9iBSSNlLRI0is5sRqfQ0mDU/vZkgbXxbEUggrO5+WS3k2f06mSvp2z7ifpfL4h6bCcuP8mAJK2l/SEpNckzZR0Xor7M7oWKjmfG+YzGhH+8c9G+QM0BP4L7Ej2EPtpQJe6zmtj+AHeBrYsF/s9cGlavhT437T8beAhQMB+wPN1nX8h/AAHAXsCr6ztOQS2AOak323Scpu6PrYCOp+XAxfladsl/fe+GdAp/R1o6L8Jq52jbYE903JLYFY6b/6M1u753CCfUY+s2cZsH+DNiJgTEcuBMUD/Os5pY9YfuCUt3wIckxO/NTLPAa0lbVsXCRaSiHgK+LhcuKbn8DDg0Yj4OCI+AR4FDl//2ReeCs5nRfoDYyJiWUS8BbxJ9vfAfxOSiFgQES+l5c+B14D2+DO6Vio5nxWp1c+oizXbmLUH3sl5PZ/K/+OxrwQwQdIUSUNTbOuIWADZHyagXYr7PFdfTc+hz23VzknTciPLpuzw+awRSR2BHsDz+DO6zsqdT9gAn1EXa7YxU56Yb2+unt4RsSfQDzhb0kGVtPV5XncVnUOf28r9BdgJ6A4sAK5JcZ/PapLUArgbOD8iPqusaZ6Yz2k5ec7nBvmMulizjdl8YPuc19sB79VRLhuViHgv/V4E3Es2NL+wbHoz/V6Umvs8V19Nz6HPbSUiYmFElEbESuBvZJ9T8PmsFkmNyQqL0RFxTwr7M7qW8p3PDfUZdbFmG7MXgc6SOklqAgwExtdxTgVPUnNJLcuWgb7AK2TnruxOr8HAfWl5PHBqultsP+DTsmkUW0NNz+EjQF9JbdL0Sd8UM1YVE2WOJfucQnY+B0raTFInoDPwAv6bsIokAX8HXouIP+as8md0LVR0PjfUZ9QPcreNVkSUSDqH7A9HQ2BkRMys47Q2BlsD92Z/e2gE/CsiHpb0InCnpNOBecCJqf2DZHeKvQksAU7b8CkXHkm3A32ALSXNBy4DrqIG5zAiPpZ0JdkfcIBfRUR1L7KvVyo4n30kdSebJnob+AFARMyUdCfwKlACnB0Rpakf/03I9AZOAWZImppiP8Wf0bVV0fk8aUN8Rv0EAzMzM7MC5mlQMzMzswLmYs3MzMysgLlYMzMzMytgLtbMzMzMCpiLNTMzM7MC5mLNzGwDkdRRUkjqnxN7sxb6Xec+Kum7maTHJT0hqcP62k8F+35S0nYbcp9mhcjFmpnZhvU68JP0JZt1SlLDajTrDrwTEQdHxLw6zMNsk+Vizcxsw3oXeAnonxuUdLmk76blAySNSsujJP1Z0kNpdOs7kiZImiLpaznb/07SREm3SWpQLjZJ0pE5+xklaTzwnXI5fF/S8+nne6mI+gvZN9jfX65tH0kvpJz+kWLdJP07jcTdKalZij+SRslekLR/vjwkHSzpP6ndtTm7Oicd72OSNkvbnivp6XRcZ6TYwJx8frcub5BZofETDMzMNrzfAmMl3Vdly8zrEfFDSTcDvSOir6TzgQHAtWR/y8dHxE8k/Q04WlIx0CYiviGpCJgk6YHU37KIODp3B5K2As4B9k6hF4H/A84HvhsRZ5TL6Tjg5xExoaw4BG5KbedJOg84HbgROC4ivpT0P6nNIbl5pFHG14BvRMTCciNtz0XEpZKGA4dK+i9wOHAQ2YDD05LuBU5O+56Vk49ZveBizcxsA4uI+ZKmAMfkhnOWy0+Rvpx+zycbmStb3iNn2xfS8vPALsBK4BuSnkzxzYC2afnZPGntCMyIiOUAkmYAnSo5jKuBSyQNBh4ne25iV+DWNMPbFPh3Gl37k6RdgFKgfU4fZXlsBXwUEQsByh7Lk0xJv+el/JsBXYAnUnxzsgdj/wS4SNnzbu/kq2demm30XKyZmdWN3wFjc15/DJRdTL9XubZRwbJyfvckK9T2Bh4GlgETIuI8AElNImJ5KqRyi6EybwG7p4dLA3RLsa4V5P9RRJyTRsVmSbqL7CHWJ6UHgJP6OgIojYgDJXVh9YdWl+XxAbCFpK0i4gNJDSJiZQXH+xpZ8Xp8RISkxhGxQlJRRAxNU6WzcbFm9YiLNTOzOpBG1yaTTelBNho0XtKBZEVSTZQAx0v6PdnI2/iIKJW0fxpZC7KRuFMqyWeRpD8Dz6TQjalwqmiTCyT1JZuKfDQiPpN0NjBKUuPU5nfAJLIbKv4N/KeCfUfadrykZWTF2LAK2r6S+pooqRRYKulo4GpJ3YDGwF8rStpsY+QHuZuZmZkVMF+EaWZmZlbAXKyZmZmZFTAXa2ZmZmYFzMWamZmZWQFzsWZmZmZWwFysmZmZmRUwF2tmZmZmBczFmpmZmVkB+3+uVoQVFjlEBwAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Bar of SemanticGroup categories, horizontal\n", + "# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html\n", + "ax = logAfterGoldStandard['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6),\n", + " color=\"slateblue\", fontsize=10);\n", + "ax.set_alpha(0.8)\n", + "ax.set_title(\"Categories assigned after 'GoldStandard' processing\", fontsize=14)\n", + "ax.set_xlabel(\"Number of searches\", fontsize=9);\n", + "# set individual bar lables using above list\n", + "for i in ax.patches:\n", + " # get_width pulls left or right; get_y pushes up or down\n", + " ax.text(i.get_width()+.1, i.get_y()+.31, \\\n", + " str(round((i.get_width()), 2)), fontsize=9, color='dimgrey')\n", + "# invert for largest on top \n", + "ax.invert_yaxis()\n", + "plt.gcf().subplots_adjust(left=0.3)\n", + "\n", + "# Remove searchLogClean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 8. Create 'uniques' dataframe/file for APIs\n", + "\n", + "\n", + "OPTIONS IF YOU DON'T WANT TO RUN THE ENTIRE LOG\n", + "\n", + "Eyeball the df and select everything with 2 or more queries\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.iloc[0:11335]\n", + "\n", + "Or, remove to a row count, such as to 186 based on looking at the content\n", + "\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.iloc[186:]\n", + "\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.reset_index()\n", + "\n", + "If you think the count is too high you could reduce the allowed character count\n", + "\n", + "mask = (listOfUniqueUnassignedAfterGS['adjustedQueryCase'].str.len() <= 15)\n", + "\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.loc[mask]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# Re-starting?\n", + "# logAfterGoldStandard = pd.read_excel(localDir + 'logAfterGoldStandard.xlsx')\n", + "\n", + "# Unique unassigned terms and frequency of occurrence\n", + "listOfUniqueUnassignedAfterGS = logAfterGoldStandard[pd.isnull(logAfterGoldStandard['preferredTerm'])] # was SemanticGroup\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.groupby('adjustedQueryCase').size()\n", + "listOfUniqueUnassignedAfterGS = pd.DataFrame({'timesSearched':listOfUniqueUnassignedAfterGS})\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.sort_values(by='timesSearched', ascending=False)\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.reset_index()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ---------------------------------------------------------------\n", + "# Eyeball for fixes - Don't give the API things it can't resolve\n", + "# ---------------------------------------------------------------\n", + "\n", + "'''\n", + "*** RUN SOME OF THIS EVERY TIME - COMMENTED OUT TO AVOID DAMAGE ****\n", + "\n", + "# Eyeball the data frame, sort by adjustedQueryCase; remove rows that as appropriate\n", + "\n", + "# logAfterGoldStandard = logAfterGoldStandard.iloc[1595:] # remove before index...\n", + "\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^[0-9]{4}\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^-\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^/\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^@\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^\\[\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^;\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^<\") == False] # char entities\n", + "listToCheck4 = listToCheck4[listToCheck4.adjustedQueryCase.str.contains(\"^>\") == False] # char entities\n", + "\n", + "listToCheck3.drop(58027, inplace=True)\n", + "'''" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "# Save to file so you can open in future sessions\n", + "writer = pd.ExcelWriter(localDir + 'listOfUniqueUnassignedAfterGS.xlsx')\n", + "listOfUniqueUnassignedAfterGS.to_excel(writer,'listOfUniqueUnassignedAfterGS')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/02_Run_APIs.ipynb b/02_Run_APIs.ipynb new file mode 100644 index 0000000..18f2c6b --- /dev/null +++ b/02_Run_APIs.ipynb @@ -0,0 +1,1077 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 2. Run APIs\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** Match query entries against UMLS REST API
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "\n", + "## Script contents\n", + "\n", + "Rather than re-using the same code during similar-and-optional runs, I duped and modified for special cases. Could be re-factored.\n", + "\n", + "1. Start-up\n", + "2. UmlsApi1 - Normalized string matching\n", + "3. Isolate entries updated by API, complete tagging, and match to the \n", + " current version of the search log - logAfterUmlsApi\n", + "4. Create logAfterUmlsApi as an update to logAfterGoldStandard by appending \n", + " newUmlsWithSemanticGroupData\n", + "5. Update GoldStandard\n", + "6. Create new 'uniques' dataframe/file for fuzzy matching\n", + "\n", + "8. UmlsApi2 - Tag non-English terms in Roman character sets\n", + "\n", + "7. UmlsApi3 - Word matching (relax prediction rules )\n", + "\n", + "8. RxNorm API\n", + "\n", + "9. UmlsApi4 - Re-run first config - Create logAfterUmlsApi4 as an \n", + " update to logAfterUmlsApi by \n", + " \n", + " append newUmlsWithSemanticGroupData\n", + " \n", + "10. Create updated training file (GoldStandard) for ML script\n", + "\n", + "Google Translate API, https://cloud.google.com/translate/\n", + "But it's not free; https://stackoverflow.com/questions/37667671/is-it-possible-to-access-to-google-translate-api-for-free\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "\n", + "* [ ] Improve/clarify processing flow\n", + "* [ ] Change SemanticNetworkReference.UniqueID to SemanticTypeCode\n", + "* [ ] Add SemanticNetworkReference.SemanticTypeCode to what goes into the logs, for ML.\n", + "\n", + "\n", + "## RESOURCES\n", + "\n", + "* Register at UMLS, get a UMLS-UTS API key, and add it below. This is the \n", + "primary source for Semantic Type classifications.\n", + "https://documentation.uts.nlm.nih.gov/rest/authentication.html\n", + "* UMLS quick start: \n", + "UMLS description of what Normalized String option is, \n", + "https://uts.nlm.nih.gov/doc/devGuide/webservices/metaops/find/find2.html\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: '01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0;31m# If you're starting a new session an this is not already open\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 22\u001b[0m \u001b[0mlistOfUniqueUnassignedAfterGS\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m \u001b[0mlistOfUniqueUnassignedAfterGS\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_excel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlistOfUniqueUnassignedAfterGS\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 24\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 25\u001b[0m \u001b[0;31m# Bring in historical file of (somewhat edited) matches\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 176\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 177\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mnew_arg_name\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnew_arg_value\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 178\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 179\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 180\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0m_deprecate_kwarg\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 176\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 177\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mnew_arg_name\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnew_arg_value\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 178\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 179\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 180\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0m_deprecate_kwarg\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py\u001b[0m in \u001b[0;36mread_excel\u001b[0;34m(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, **kwds)\u001b[0m\n\u001b[1;32m 305\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 306\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mio\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mExcelFile\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 307\u001b[0;31m \u001b[0mio\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mExcelFile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mio\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 308\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 309\u001b[0m return io.parse(\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, io, **kwds)\u001b[0m\n\u001b[1;32m 392\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbook\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxlrd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mopen_workbook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile_contents\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 393\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_io\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcompat\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstring_types\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 394\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbook\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxlrd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mopen_workbook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_io\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 395\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 396\u001b[0m raise ValueError('Must explicitly set engine if not passing in'\n", + "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/xlrd/__init__.py\u001b[0m in \u001b[0;36mopen_workbook\u001b[0;34m(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)\u001b[0m\n\u001b[1;32m 114\u001b[0m \u001b[0mpeek\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfile_contents\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mpeeksz\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 115\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 116\u001b[0;31m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"rb\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 117\u001b[0m \u001b[0mpeek\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpeeksz\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mpeek\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34mb\"PK\\x03\\x04\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# a ZIP file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx'" + ] + } + ], + "source": [ + "# 1. Start-up / What to put into place, where\n", + "# ============================================\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import requests\n", + "import json\n", + "import lxml.html as lh\n", + "from lxml.html import fromstring\n", + "import time\n", + "import os\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/webDS')\n", + "\n", + "\n", + "localDir = '02_Run_APIs_files/'\n", + "\n", + "# If you're starting a new session an this is not already open\n", + "listOfUniqueUnassignedAfterGS = '01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx'\n", + "listOfUniqueUnassignedAfterGS = pd.read_excel(listOfUniqueUnassignedAfterGS)\n", + "\n", + "# Bring in historical file of (somewhat edited) matches\n", + "GoldStandard = '01_Pre-processing_files/GoldStandard_master.xlsx'\n", + "GoldStandard = pd.read_excel(GoldStandard)\n", + "\n", + "\n", + "\n", + "# Get API key\n", + "def get_umls_api_key(filename=None):\n", + " key = os.environ.get('UMLS_API_KEY', None)\n", + " if key is not None:\n", + " return key\n", + " if filename is None:\n", + " path = os.environ.get('HOME', None)\n", + " if path is None:\n", + " path = os.environ.get('USERPROFILE', None)\n", + " if path is None:\n", + " path = '.'\n", + " filename = os.path.join(path, '.umls_api_key')\n", + " with open(filename, 'r') as f:\n", + " key = f.readline().strip()\n", + " return key\n", + "\n", + "myUTSAPIkey = get_umls_api_key()\n", + "\n", + "\n", + "'''\n", + "GoldStandard.xlsx - Already-assigned term list, from UMLS and other sources, \n", + " vetted.\n", + "'''\n", + "\n", + "\n", + "'''\n", + "SemanticNetworkReference - Customized version of the list at \n", + "https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html, \n", + "to be used to put search terms into huge bins. Should be integrated into \n", + "GoldStandard and be available at the end of the ML matching process.\n", + "'''\n", + "SemanticNetworkReference = '01_Pre-processing_files/SemanticNetworkReference.xlsx'\n", + "\n", + "\n", + "''' \n", + "- Run what remains against the UMLS API.\n", + "\n", + "Requires having your own license and API key; see https://www.nlm.nih.gov/research/umls/\n", + "Not shown here: \n", + " - In huge files I sort by count and focus on terms searched by multiple\n", + " or many people. The 'long tail' can be huge.\n", + " - I have a database of terms aready assigned. I match these before \n", + " contacting UMLS; no need to check them again. Shortens processing time.\n", + "More options:\n", + " https://documentation.uts.nlm.nih.gov/rest/home.html\n", + " https://documentation.uts.nlm.nih.gov/rest/concept/\n", + "\n", + "'''\n", + "\n", + "\n", + "\n", + "# unassignedAfterUmls1 = pd.read_excel(localdir + 'unassignedAfterUmls1.xlsx')\n", + "\n", + "'''\n", + "Register at RxNorm, get API key, and add it below. This is for drug misspellings.\n", + "'''\n", + "\n", + "# Generate a one-day Ticket-Granting-Ticket (TGT)\n", + "tgt = requests.post('https://utslogin.nlm.nih.gov/cas/v1/api-key', data = {'apikey':myUTSAPIkey})\n", + "# For API key get a license from https://www.nlm.nih.gov/research/umls/\n", + "# tgt.text\n", + "response = fromstring(tgt.text)\n", + "todaysTgt = response.xpath('//form/@action')[0]\n", + "\n", + "uiUri = \"https://uts-ws.nlm.nih.gov/rest/search/current?\"\n", + "semUri = \"https://uts-ws.nlm.nih.gov/rest/content/current/CUI/\"\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 2. UmlsApi1 - Normalized string matching\n", + "# =========================================\n", + "'''\n", + "In this run the API calls use the Normalized String setting. Example: \n", + "for the input string Yellow leaves, normalizedString would return two strings, \n", + "leaf yellow and leave yellow. Each string would be matched exactly to the \n", + "strings in the normalized string index to return a result. \n", + "\n", + "Re-start:\n", + "# listOfUniqueUnassignedAfterGS = pd.read_excel('01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx')\n", + "\n", + "listToCheck6 = pd.read_excel(localDir + 'listToCheck6.xlsx')\n", + "listToCheck7 = pd.read_excel(localDir + 'listToCheck7.xlsx')\n", + "'''\n", + "\n", + "# ---------------------------------------\n", + "# Batch rows so you can do separate runs\n", + "# Max of 5,000 rows per run\n", + "# ---------------------------------------\n", + "\n", + "# uniqueSearchTerms = search['adjustedQueryCase'].unique()\n", + "\n", + "# Reduce entry length, to focus on single concepts that UTS API can match\n", + "listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.loc[(listOfUniqueUnassignedAfterGS['adjustedQueryCase'].str.len() <= 20) == True]\n", + "\n", + "\n", + "# listToCheck1 = unassignedAfterGS.iloc[0:20]\n", + "listToCheck1 = listOfUniqueUnassignedAfterGS.iloc[0:6000]\n", + "listToCheck2 = listOfUniqueUnassignedAfterGS.iloc[6001:12000]\n", + "listToCheck3 = listOfUniqueUnassignedAfterGS.iloc[12001:18000]\n", + "listToCheck4 = listOfUniqueUnassignedAfterGS.iloc[18001:24000]\n", + "listToCheck5 = listOfUniqueUnassignedAfterGS.iloc[24001:30000]\n", + "listToCheck6 = listOfUniqueUnassignedAfterGS.iloc[30001:36000]\n", + "listToCheck7 = listOfUniqueUnassignedAfterGS.iloc[36001:39523]\n", + "\n", + "\n", + "\n", + "'''\n", + "listToCheck1 = unassignedToCheck.iloc[12497:20000]\n", + "listToCheck2 = unassignedToCheck.iloc[20001:26000]\n", + "listToCheck3 = unassignedToCheck.iloc[23225:28000]\n", + "listToCheck4 = unassignedToCheck.iloc[28001:31256]\n", + "\n", + "mask = (unassignedToCheck['adjustedQueryCase'].str.len() <= 15)\n", + "listToCheck3 = listToCheck3.loc[mask]\n", + "listToCheck4 = listToCheck4.loc[mask]\n", + "'''\n", + "\n", + "\n", + "# If multiple sessions required, saving to file might help\n", + "writer = pd.ExcelWriter(localDir + 'listToCheck7.xlsx')\n", + "listToCheck7.to_excel(writer,'listToCheck7')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'listToCheck2.xlsx')\n", + "listToCheck2.to_excel(writer,'listToCheck2')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "'''\n", + "OPTIONS\n", + "\n", + "# Bring in from file\n", + "listToCheck3 = pd.read_excel(localDir + 'listToCheck3.xlsx')\n", + "listToCheck4 = pd.read_excel(localDir + 'listToCheck4.xlsx')\n", + "\n", + "listToCheck1 = unassignedAfterGS\n", + "listToCheck2 = unassignedAfterGS.iloc[5001:10000]\n", + "listToCheck1 = unassignedAfterGS.iloc[10001:11335]\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Run this block after changing listToCheck# top and bottom\n", + "# ----------------------------------------------------------\n", + "'''\n", + "Until you put this into a function, you need to change listToCheck# \n", + "and apiGetNormalizedString# counts every run!\n", + "Stay below 30 API requests per second. With 4 API requests per item\n", + "(2 .get and 2 .post requests)...\n", + "time.sleep commented out: 6,000 / 35 min = 171 per minute = 2.9 items per second / 11.4 requests per second\n", + "Computing differently, 6,000 items @ 4 Req per item = 24,000 Req, divided by 35 min+\n", + "686 Req/min = 11.4 Req/sec\n", + "time.sleep(.07): ~38 minutes to do 6,000; 158 per minute / 2.6 items per second\n", + "'''\n", + "\n", + "apiGetNormalizedString = pd.DataFrame()\n", + "apiGetNormalizedString['adjustedQueryCase'] = \"\"\n", + "apiGetNormalizedString['preferredTerm'] = \"\"\n", + "apiGetNormalizedString['SemanticTypeName'] = \"\"\n", + "\n", + "'''\n", + "For file 6, 7/5/18 1:05 p.m.: SSLError: HTTPSConnectionPool(host='utslogin.nlm.nih.gov', \n", + "port=443): Max retries exceeded with url: \n", + "/cas/v1/api-key/TGT-480224-qLwYAMKl5cTfa7Jwb7RWZ3kfexPUm479HfddD7yVUKt79lZ0Ta-cas \n", + "(Caused by SSLError(SSLError(\"bad handshake: SysCallError(60, 'ETIMEDOUT')\",),))\n", + "\n", + "Later, run 6 and 7\n", + "'''\n", + "\n", + "\n", + "for index, row in listToCheck7.iterrows():\n", + " currLogTerm = row['adjustedQueryCase']\n", + " # === Get 'preferred term' and its concept identifier (CUI/UI) =========\n", + " stTicket = requests.post(todaysTgt, data = {'service':'http://umlsks.nlm.nih.gov'}) # Get single-use Service Ticket (ST)\n", + " # Example: GET https://uts-ws.nlm.nih.gov/rest/search/current?string=tylenol&sabs=MSH&ticket=ST-681163-bDfgQz5vKe2DJXvI4Snm-cas\n", + " tQuery = {'string':currLogTerm, 'searchType':'normalizedString', 'ticket':stTicket.text} # removed 'sabs':'MSH', \n", + " getPrefTerm = requests.get(uiUri, params=tQuery)\n", + " getPrefTerm.encoding = 'utf-8'\n", + " tItems = json.loads(getPrefTerm.text)\n", + " tJson = tItems[\"result\"]\n", + " if tJson[\"results\"][0][\"ui\"] != \"NONE\": # Sub-loop to resolve \"NONE\"\n", + " currUi = tJson[\"results\"][0][\"ui\"]\n", + " currPrefTerm = tJson[\"results\"][0][\"name\"]\n", + " # === Get 'semantic type' =========\n", + " stTicket = requests.post(todaysTgt, data = {'service':'http://umlsks.nlm.nih.gov'}) # Get single-use Service Ticket (ST)\n", + " # Example: GET https://uts-ws.nlm.nih.gov/rest/content/current/CUI/C0699142?ticket=ST-512564-vUxzyI00ErMRm6tjefNP-cas\n", + " semQuery = {'ticket':stTicket.text}\n", + " getPrefTerm = requests.get(semUri+currUi, params=semQuery)\n", + " getPrefTerm.encoding = 'utf-8'\n", + " semItems = json.loads(getPrefTerm.text)\n", + " semJson = semItems[\"result\"]\n", + " currSemTypes = []\n", + " for name in semJson[\"semanticTypes\"]:\n", + " currSemTypes.append(name[\"name\"]) # + \" ; \"\n", + " # === Post to dataframe =========\n", + " apiGetNormalizedString = apiGetNormalizedString.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, \n", + " 'preferredTerm': currPrefTerm, \n", + " 'SemanticTypeName': currSemTypes[0]}, index=[0]), ignore_index=True)\n", + " print('{} --> {}'.format(currLogTerm, currSemTypes[0])) # Write progress to console\n", + " # time.sleep(.06)\n", + " else:\n", + " # Post \"NONE\" to database and restart loop\n", + " apiGetNormalizedString = apiGetNormalizedString.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, 'preferredTerm': \"NONE\"}, index=[0]), ignore_index=True)\n", + " print('{} --> NONE'.format(currLogTerm, )) # Write progress to console\n", + " # time.sleep(.06)\n", + "print (\"* Done *\")\n", + "\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'apiGetNormalizedString7.xlsx')\n", + "apiGetNormalizedString.to_excel(writer,'apiGetNormalizedString')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "# Free up memory: Remove listToCheck, listToCheck1, listToCheck2, listToCheck3, \n", + "# listToCheck4, nonForeign, searchLog, unassignedAfterGS\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Isolate entries updated by API, complete tagging, and match to \n", + "# the current version of the search log - logAfterUmlsApi\n", + "# ==================================================================\n", + "'''\n", + "To Do:\n", + "\n", + " Isolate new assignments and:\n", + " - merge them into the master version of the log\n", + " - add to GoldStandard for next time\n", + " \n", + " # Move unassigned entries into workflow for human identification\n", + " \n", + "To re-start\n", + "\n", + "unassignedAfterGS = pd.read_excel(localDir + 'unassignedAfterGS.xlsx')\n", + "logAfterGoldStandard = pd.read_excel(localDir + 'logAfterGoldStandard.xlsx')\n", + "\n", + "listFromApi = pd.read_excel('02_UMLS_API_files/listFromApi1-April-May.xlsx')\n", + "assignedByUmlsApi = pd.read_excel(localDir + 'assignedByUmlsApi.xlsx')\n", + "\n", + "# Fix temporary issue of nulls in SemanticTypeName, and wrong col name semTypeName\n", + " \n", + "listFromApi.drop(['SemanticTypeName'], axis=1, inplace=True)\n", + "listFromApi.rename(columns={'semTypeName': 'SemanticTypeName'}, inplace=True)\n", + "\n", + "# listFromApi = listFromApi.dropna(subset=['SemanticTypeName'])\n", + " '''\n", + " \n", + "\n", + "# If you stored output from UMLS API in files, re-open and unite\n", + "newAssignments1 = pd.read_excel(localDir + 'apiGetNormalizedString1.xlsx')\n", + "newAssignments2 = pd.read_excel(localDir + 'apiGetNormalizedString2.xlsx')\n", + "newAssignments3 = pd.read_excel(localDir + 'apiGetNormalizedString3.xlsx')\n", + "newAssignments4 = pd.read_excel(localDir + 'apiGetNormalizedString4.xlsx')\n", + "newAssignments5 = pd.read_excel(localDir + 'apiGetNormalizedString5.xlsx')\n", + "newAssignments6 = pd.read_excel(localDir + 'apiGetNormalizedString6.xlsx')\n", + "newAssignments7 = pd.read_excel(localDir + 'apiGetNormalizedString7.xlsx')\n", + "\n", + "\n", + "# Put dataframes together into one; df = df1.append([df2, df3])\n", + "afterUmlsApi1 = newAssignments1.append([newAssignments2, newAssignments3, newAssignments4, newAssignments5])\n", + "afterUmlsApi1 = newAssignments6.append([newAssignments7])\n", + "\n", + "\n", + "'''\n", + "afterUmlsApi1 = afterUmlsApi1.append(newAssignments3)\n", + "afterUmlsApi1 = afterUmlsApi1.append(newAssignments4)\n", + "'''\n", + "\n", + "\n", + "# If you only used one df for listFromApi\n", + "# afterUMLSapi = listFromApi\n", + "# assignedByUmlsApi = listFromApi\n", + "\n", + "\n", + "# Reduce to a version that has only successful assignments\n", + "\n", + "# Remove various problem entries\n", + "assignedByUmlsApi1 = afterUmlsApi1.loc[(afterUmlsApi1['preferredTerm'] != \"NONE\")]\n", + "assignedByUmlsApi1 = assignedByUmlsApi1[~pd.isnull(assignedByUmlsApi1['preferredTerm'])]\n", + "assignedByUmlsApi1 = assignedByUmlsApi1.loc[(assignedByUmlsApi1['preferredTerm'] != \"Null Value\")]\n", + "assignedByUmlsApi1 = assignedByUmlsApi1[~pd.isnull(assignedByUmlsApi1['adjustedQueryCase'])]\n", + "\n", + "\n", + "# If you want to send to Excel\n", + "writer = pd.ExcelWriter(localDir + 'assignedByUmlsApi1.xlsx')\n", + "assignedByUmlsApi1.to_excel(writer,'assignedByUmlsApi1')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "# Bring in subject category master file\n", + "# SemanticNetworkReference = pd.read_excel(localDir + 'SemanticNetworkReference.xlsx')\n", + "SemanticNetworkReference = pd.read_excel(SemanticNetworkReference)\n", + "\n", + "# Reduce to required cols\n", + "SemTypeData = SemanticNetworkReference[['SemanticTypeName', 'SemanticGroupCode', 'SemanticGroup', 'CustomTreeNumber', 'BranchPosition']]\n", + "# SemTypeData.rename(columns={'SemanticTypeName': 'semTypeName'}, inplace=True) # The join col\n", + "\n", + "# Add more semantic tagging to new UMLS API adds\n", + "newUmlsWithSemanticGroupData = pd.merge(assignedByUmlsApi1, SemTypeData, how='left', on='SemanticTypeName')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 4. Create logAfterUmlsApi as an update to logAfterGoldStandard by appending \n", + "# newUmlsWithSemanticGroupData\n", + "# ============================================================================\n", + "\n", + "'''\n", + "Depending on what you're processing, use this or the next section of the below.\n", + "\n", + "Depends on how you choose to process - Like, down to one occurrence to API \n", + "in first batch, or not.\n", + "'''\n", + "\n", + "\n", + "logAfterGoldStandard = '01_Pre-processing_files/logAfterGoldStandard.xlsx'\n", + "logAfterGoldStandard = pd.read_excel(logAfterGoldStandard)\n", + "\n", + "\n", + "'''\n", + "# FIXME - Remove after this is fixed within the fixme above.\n", + "logAfterGoldStandard = logAfterGoldStandard.sort_values(by='adjustedQueryCase', ascending=True)\n", + "logAfterGoldStandard = logAfterGoldStandard.reset_index()\n", + "logAfterGoldStandard.drop(['index'], axis=1, inplace=True)\n", + "'''\n", + "\n", + "\n", + "# Eyeball. If you need to remove rows...\n", + "# logAfterGoldStandard = logAfterGoldStandard.iloc[760:] # remove before index...\n", + "\n", + "# Join new UMLS API adds to the current search log master\n", + "logAfterUmlsApi1 = pd.merge(logAfterGoldStandard, newUmlsWithSemanticGroupData, how='left', on='adjustedQueryCase')\n", + "\n", + "logAfterUmlsApi1.columns\n", + "\n", + "'''\n", + "['SessionID', 'StaffYN', 'Referrer', 'Query', 'Timestamp',\n", + " 'adjustedQueryCase', 'SemanticTypeName_x', 'SemanticGroup_x',\n", + " 'SemanticGroupCode_x', 'BranchPosition_x', 'CustomTreeNumber_x',\n", + " 'ResourceType', 'Address', 'EntrySource', 'contentSteward',\n", + " 'preferredTerm_x', 'SemanticTypeName_y', 'preferredTerm_y',\n", + " 'SemanticGroupCode_y', 'SemanticGroup_y', 'CustomTreeNumber_y',\n", + " 'BranchPosition_y']\n", + "\n", + "'''\n", + "\n", + "\n", + "# Future: Look for a better way to do the above - MERGE WITH CONDITIONAL OVERWRITE. Temporary fix:\n", + "logAfterUmlsApi1['preferredTerm2'] = logAfterUmlsApi1['preferredTerm_x'].where(logAfterUmlsApi1['preferredTerm_x'].notnull(), logAfterUmlsApi1['preferredTerm_y'])\n", + "logAfterUmlsApi1['SemanticTypeName2'] = logAfterUmlsApi1['SemanticTypeName_x'].where(logAfterUmlsApi1['SemanticTypeName_x'].notnull(), logAfterUmlsApi1['SemanticTypeName_y'])\n", + "logAfterUmlsApi1['SemanticGroup2'] = logAfterUmlsApi1['SemanticGroup_x'].where(logAfterUmlsApi1['SemanticGroup_x'].notnull(), logAfterUmlsApi1['SemanticGroup_y'])\n", + "logAfterUmlsApi1['SemanticGroupCode2'] = logAfterUmlsApi1['SemanticGroupCode_x'].where(logAfterUmlsApi1['SemanticGroupCode_x'].notnull(), logAfterUmlsApi1['SemanticGroupCode_y'])\n", + "logAfterUmlsApi1['BranchPosition2'] = logAfterUmlsApi1['BranchPosition_x'].where(logAfterUmlsApi1['BranchPosition_x'].notnull(), logAfterUmlsApi1['BranchPosition_y'])\n", + "logAfterUmlsApi1['CustomTreeNumber2'] = logAfterUmlsApi1['CustomTreeNumber_x'].where(logAfterUmlsApi1['CustomTreeNumber_x'].notnull(), logAfterUmlsApi1['CustomTreeNumber_y'])\n", + "logAfterUmlsApi1.drop(['preferredTerm_x', 'preferredTerm_y',\n", + " 'SemanticTypeName_x', 'SemanticTypeName_y',\n", + " 'SemanticGroup_x', 'SemanticGroup_y',\n", + " 'SemanticGroupCode_x', 'SemanticGroupCode_y',\n", + " 'BranchPosition_x', 'BranchPosition_y', \n", + " 'CustomTreeNumber_x', 'CustomTreeNumber_y'], axis=1, inplace=True)\n", + "logAfterUmlsApi1.rename(columns={'preferredTerm2': 'preferredTerm',\n", + " 'SemanticTypeName2': 'SemanticTypeName',\n", + " 'SemanticGroup2': 'SemanticGroup',\n", + " 'SemanticGroupCode2': 'SemanticGroupCode',\n", + " 'BranchPosition2': 'BranchPosition',\n", + " 'CustomTreeNumber2': 'CustomTreeNumber'\n", + " }, inplace=True)\n", + "\n", + "# Save to file so you can open in future sessions, if needed\n", + "writer = pd.ExcelWriter(localDir + 'logAfterUmlsApi1.xlsx')\n", + "logAfterUmlsApi1.to_excel(writer,'logAfterUmlsApi1')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "'''\n", + "To Do:\n", + " - Create list of unmatched terms with freq\n", + " - Cluster similar spellings together?\n", + " \n", + "- Look at \"Not currently matchable\" terms with \"high\" frequency counts. Eyeball to see if these were incorrectly matched in the past; assign historical term or update all to new term, save in gold standard file.\n", + "- Process entries from the PubMed product page.\n", + "- If you haven't done so, update RegEx list to improve future matching.\n", + "- Every several months, through Flask interface, interactively update the gold standard, manually.\n", + "\n", + "# Reduce logAfterUmlsApi to unique, unmatched entries, prep for ML\n", + "\n", + "To re-start:\n", + "logAfterUmlsApi = pd.read_excel(localDir + 'logAfterUmlsApi.xlsx')\n", + "'''\n", + "\n", + "\n", + "# ------------------------------------\n", + "# Visualize results - logAfterUmlsApi\n", + "# ------------------------------------\n", + " \n", + "# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/\n", + "totCount = len(logAfterUmlsApi1)\n", + "unassigned = logAfterUmlsApi1['SemanticGroup'].isnull().sum()\n", + "assigned = totCount - unassigned\n", + "labels = ['Assigned', 'Unassigned']\n", + "sizes = [assigned, unassigned]\n", + "colors = ['lightskyblue', 'lightcoral']\n", + "explode = (0.1, 0) # explode 1st slice\n", + "plt.pie(sizes, explode=explode, labels=labels, colors=colors,\n", + " autopct='%1.f%%', shadow=True, startangle=100)\n", + "plt.axis('equal')\n", + "plt.title(\"Status after 'UMLS API' processing\")\n", + "plt.show()\n", + "\n", + "# Bar of SemanticGroup categories, horizontal\n", + "# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html\n", + "ax = logAfterUmlsApi1['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6),\n", + " color=\"slateblue\", fontsize=10);\n", + "ax.set_alpha(0.8)\n", + "ax.set_title(\"Categories assigned after 'UMLS API' processing\", fontsize=14)\n", + "ax.set_xlabel(\"Number of searches\", fontsize=9);\n", + "# set individual bar lables using above list\n", + "for i in ax.patches:\n", + " # get_width pulls left or right; get_y pushes up or down\n", + " ax.text(i.get_width()+.1, i.get_y()+.31, \\\n", + " str(round((i.get_width()), 2)), fontsize=9, color='dimgrey')\n", + "# invert for largest on top \n", + "ax.invert_yaxis()\n", + "plt.gcf().subplots_adjust(left=0.3)\n", + "\n", + "# Remove listOfUniqueUnassignedAfterGS, listToCheck1, etc., logAfterGoldStandard, logAfterUmlsApi1, \n", + "# newAssignments1 etc.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Update GoldStandard\n", + "# =======================\n", + "\n", + "# Open GoldStandard if needed\n", + "GoldStandard = '01_Pre-processing_files/GoldStandard.xlsx'\n", + "GoldStandard = pd.read_excel(GoldStandard)\n", + "\n", + "# Append fully tagged UMLS API adds to GoldStandard\n", + "GoldStandard = GoldStandard.append(newUmlsWithSemanticGroupData, sort=False)\n", + "\n", + "# Reset index\n", + "GoldStandard = GoldStandard.reset_index()\n", + "GoldStandard.drop(['index'], axis=1, inplace=True)\n", + "# temp GoldStandard.drop(['adjustedQueryCase'], axis=1, inplace=True)\n", + "\n", + "'''\n", + "Eyeball top and bottom of cols, remove rows by Index, if needed\n", + "\n", + "GoldStandard.drop(58027, inplace=True)\n", + "'''\n", + "\n", + "\n", + "# Write out the updated GoldStandard\n", + "writer = pd.ExcelWriter('01_Pre-processing_files/GoldStandard.xlsx')\n", + "GoldStandard.to_excel(writer,'GoldStandard')\n", + "writer.save()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 6. Start new 'uniques' dataframe that gets new column for each of the below\n", + "# listOfUniqueUnassignedAfterUmls1\n", + "# ============================================================================\n", + "\n", + "'''\n", + "To Do:\n", + " - Create list of unmatched terms with freq\n", + " - Cluster similar spellings together?\n", + " \n", + "- Look at \"Not currently matchable\" terms with \"high\" frequency counts. Eyeball to see if these were incorrectly matched in the past; assign historical term or update all to new term, save in gold standard file.\n", + "- Process entries from the PubMed product page.\n", + "- If you haven't done so, update RegEx list to improve future matching.\n", + "- Every several months, through Flask interface, interactively update the gold standard, manually.\n", + "\n", + "# Reduce logAfterUmlsApi to unique, unmatched entries, prep for ML\n", + "\n", + "To re-start:\n", + "logAfterUmlsApi = pd.read_excel(localDir + 'logAfterUmlsApi.xlsx')\n", + "'''\n", + "\n", + "listOfUniqueUnassignedAfterUmls1 = logAfterUmlsApi1[pd.isnull(logAfterUmlsApi1['SemanticGroup'])]\n", + "listOfUniqueUnassignedAfterUmls1 = listOfUniqueUnassignedAfterUmls1.groupby('adjustedQueryCase').size()\n", + "listOfUniqueUnassignedAfterUmls1 = pd.DataFrame({'timesSearched':listOfUniqueUnassignedAfterUmls1})\n", + "listOfUniqueUnassignedAfterUmls1 = listOfUniqueUnassignedAfterUmls1.sort_values(by='timesSearched', ascending=False)\n", + "listOfUniqueUnassignedAfterUmls1 = listOfUniqueUnassignedAfterUmls1.reset_index()\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'listOfUniqueUnassignedAfterUmls11.xlsx')\n", + "listOfUniqueUnassignedAfterUmls1.to_excel(writer,'unassignedToCheck')\n", + "writer.save()\n", + "\n", + "# FY 18 Q3: 57,287\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Google Translate API, https://cloud.google.com/translate/\n", + "# =============================================================\n", + "'''\n", + "But it's not free; https://stackoverflow.com/questions/37667671/is-it-possible-to-access-to-google-translate-api-for-free\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. UmlsApi2 - Tag non-English terms in Roman character sets\n", + "# ==========================================================================\n", + "'''\n", + "Some foreign terms can be matched. This run does not return a preferred term,\n", + "just returns what vocabulary the term is found in. \n", + "\n", + "Queries with words not in English are ignored by the first API run using\n", + "\"normalized string\" matching. Here, try flagging what you can and take them \n", + "out of the percent-complete calculation.\n", + "\n", + "The API apparently only supports U.S. English. RegEx could be used to convert\n", + "UTF-8 Roman characters that are not English... Non-Roman languages (Chinese, \n", + "Cyrillic, Arabic, Japanese, etc.) are not supported by the API; these should \n", + "be kept out of the API runs entirely.\n", + "\n", + "6/22/18, from David of UMLS support, TRACKING:000308010\n", + "\n", + "> Can the UMLS REST API tell me the term's language? \n", + "\n", + "One option would be to specify returnIdType=sourceUi for your search \n", + "request. For example: \n", + " \n", + "https://uts-ws.nlm.nih.gov/rest/search/current?string=Infarto de miocardio&returnIdType=sourceUi&ticket=\n", + "\n", + "This will give you a set of codes back where there is a match, but will \n", + "also return a vocabulary (rootSource). If you have that, you can get \n", + "the language (in this case, Spanish). The first result may be all you \n", + "need. If you have the rootSource, you can match it to the \"abbreviation\" \n", + "and look up the language here: https://uts-ws.nlm.nih.gov/rest/metadata/current/sources. \n", + " \n", + "It won't be perfect. I'm seeing some problems with accented characters. \n", + "For example, coração returns no results, so that's not great, but may \n", + "not matter. Some strings will appear in multiple languages, too. \n", + " \n", + "Let me know how that works for you. - David\n", + "'''\n", + "\n", + "\n", + "# ------------------------------------------------------\n", + "# Batch up your API runs. Re-starting, correcting, etc.\n", + "# ------------------------------------------------------\n", + "\n", + "# uniqueSearchTerms = search['adjustedQueryCase'].unique()\n", + "\n", + "# vocabCheck1 = unassignedAfterGS.iloc[0:20]\n", + "vocabCheck1 = listOfUniqueUnassignedAfterUmls1.iloc[0:5000]\n", + "# vocabCheck2 = listOfUniqueUnassignedAfterUmls1.iloc[5001:10678]\n", + "\n", + "\n", + "# If multiple sessions required, saving to file might help\n", + "writer = pd.ExcelWriter(localDir + 'vocabCheck1.xlsx')\n", + "vocabCheck1.to_excel(writer,'vocabCheck')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "'''\n", + "writer = pd.ExcelWriter(localDir + 'listToCheck2.xlsx')\n", + "listToCheck2.to_excel(writer,'listToCheck2')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "'''\n", + "\n", + "\n", + "'''\n", + "OPTIONS\n", + "\n", + "# Bring in from file\n", + "listToCheck3 = pd.read_excel('01 Pre-process/listToCheck3.xlsx')\n", + "listToCheck4 = pd.read_excel('01 Pre-process/listToCheck4.xlsx')\n", + "\n", + "listToCheck1 = unassignedAfterGS\n", + "listToCheck2 = unassignedAfterGS.iloc[5001:10000]\n", + "listToCheck1 = unassignedAfterGS.iloc[10001:11335]\n", + "'''\n", + "\n", + "\n", + "'''\n", + "Work with PostMan app to test/approve\n", + "\n", + "David from UMLS: One option would be to specify returnIdType=sourceUi for \n", + "your search request. For example:\n", + " \n", + "https://uts-ws.nlm.nih.gov/rest/search/current?string=Infarto de miocardio&returnIdType=sourceUi&ticket=\n", + "\n", + "This will give you a set of codes back where there is a match, but will \n", + "also return a vocabulary (rootSource). If you have that, you can get the \n", + "language (in this case, Spanish). The first result may be all you need. \n", + "If you have the rootSource, you can match it to the \"abbreviation\" and \n", + "look up the language here: https://uts-ws.nlm.nih.gov/rest/metadata/current/sources. \n", + " \n", + "It won't be perfect. I'm seeing some problems with accented characters. \n", + "For example, coração returns no results, so that's not great, but may \n", + "not matter. Some strings will appear in multiple languages, too. \n", + " \n", + "Let me know how that works for you.\n", + "'''\n", + "\n", + "'''\n", + "FIXME - Unfinished.\n", + "\n", + "TGT-16294-ajZgfOTNGBxvzAXAvQslZtuL2U0HksFsED6tZ0ajoewNBNdSVz-cas\n", + "\n", + "\n", + "# THIS IS SOURCE VOCAB CODE\n", + "\n", + "https://uts-ws.nlm.nih.gov/rest/search/current?string=Infarto de miocardio&returnIdType=sourceUi&ticket=\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Gather list of source vocabularies\n", + "# -----------------------------------\n", + "\n", + "uiUri = \"https://uts-ws.nlm.nih.gov/rest/search/current?returnIdType=sourceUi\"\n", + "\n", + "listOfSourceVocabularies = pd.DataFrame()\n", + "listOfSourceVocabularies['adjustedQueryCase'] = \"\"\n", + "listOfSourceVocabularies['sourceVocab'] = \"\"\n", + "\n", + "for index, row in listToCheck1.iterrows():\n", + " currLogTerm = row['adjustedQueryCase']\n", + " # === Get 'source vocab' =========\n", + " stTicket = requests.post(todaysTgt, data = {'service':'http://umlsks.nlm.nih.gov'}) # Get single-use Service Ticket (ST)\n", + " # Example: GET https://uts-ws.nlm.nih.gov/rest/search/current?string=tylenol&sabs=MSH&ticket=ST-681163-bDfgQz5vKe2DJXvI4Snm-cas\n", + " termQuery = {'string':currLogTerm, 'ticket':stTicket.text} # removed 'searchType':'word' (it's the default), 'sabs':'MSH', \n", + " getSourceVocab = requests.get(uiUri, params=termQuery)\n", + " getSourceVocab.encoding = 'utf-8'\n", + " tItems = json.loads(getSourceVocab.text)\n", + " tJson = tItems[\"result\"]\n", + " if tJson[\"results\"][0][\"ui\"] != \"NONE\": # Sub-loop to resolve \"NONE\"\n", + " currUi = tJson[\"results\"][0][\"rootSource\"]\n", + " sourceVocab = tJson[\"results\"][0][\"rootSource\"]\n", + " # === Post to dataframe =========\n", + " listOfSourceVocabularies = listOfSourceVocabularies.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, \n", + " 'sourceVocab': sourceVocab}, index=[0]), ignore_index=True)\n", + " print('{} --> {}'.format(currLogTerm, sourceVocab)) # Write progress to console\n", + " time.sleep(.07)\n", + " else:\n", + " # Post \"NONE\" to database and restart loop\n", + " listOfSourceVocabularies = listOfSourceVocabularies.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, 'sourceVocab': \"NONE\"}, index=[0]), ignore_index=True)\n", + " print('{} --> NONE'.format(currLogTerm, )) # Write progress to console\n", + " time.sleep(.07)\n", + "print (\"* Done *\")\n", + "\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'listOfSourceVocabularies.xlsx')\n", + "listOfSourceVocabularies.to_excel(writer,'listOfSourceVocabularies')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "# Free up memory: Remove listToCheck, listToCheck1, listToCheck2, listToCheck3, \n", + "# listToCheck4, nonForeign, searchLog, unassignedAfterGS\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Load external reference file: SourceVocabsForeign.xlsx\n", + "\n", + "# F&R Foreign vocab names with the language name, \"Spanish,\" \"Swedish\"\n", + "\n", + "# Append to running list of updates\n", + "\n", + "\n", + "\n", + "# ------------------------------------------------------\n", + "# Match vocabCheck \n", + "# ------------------------------------------------------\n", + "'''\n", + "FIXME - RESULTING LIST NEEDS TO BE VETTED; START WITH HIGHEST-FREQUENCY USE.\n", + "\n", + "Update naming? this is the result from the API run for languages\n", + "\n", + "This custom list of vocabs does not include and English vocabs, therefore, \n", + "only foreign matches are returned, which is what we want.\n", + "\n", + "Re-start:\n", + "listOfSourceVocabularies = pd.read_excel(localDir + 'listOfSourceVocabularies.xlsx')\n", + "'''\n", + "\n", + "# Load list of Non-English vocabularies\n", + "# 7/5/2018, https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html (English vocabs not included.)\n", + "UMLS_NonEnglish_Vocabularies = pd.read_excel(localDir + 'UMLS_Non-English_Vocabularies.xlsx')\n", + "\n", + "# Inner join\n", + "foreignButEnglishChar = pd.merge(listOfSourceVocabularies, UMLS_NonEnglish_Vocabularies, how='inner', left_on='sourceVocab', right_on='Vocabulary')\n", + "\n", + "\n", + "# Get frequency count, reduce cols for easier manual checking\n", + "PerhapsForeign = pd.merge(foreignButEnglishChar, listOfUniqueUnassignedAfterUmls1, how='inner', on='adjustedQueryCase')\n", + "\n", + "PerhapsForeign = PerhapsForeign.sort_values(by='timesSearched', ascending=False)\n", + "PerhapsForeign = PerhapsForeign.reset_index()\n", + "PerhapsForeign.drop(['index'], axis=1, inplace=True)\n", + "col = ['adjustedQueryCase', 'timesSearched', 'Language']\n", + "PerhapsForeign = PerhapsForeign[col]\n", + "PerhapsForeign.rename(columns={'Language': 'LanguageGuess'}, inplace=True)\n", + "\n", + "# Send out for manual checking\n", + "writer = pd.ExcelWriter(localDir + 'PerhapsForeign.xlsx')\n", + "PerhapsForeign.to_excel(writer,'PerhapsForeign')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "'''\n", + "In Excel or Flask, delete rows with terms that we use in English; check that \n", + "tyhe remaining rows contain terms that most English speakers would think are \n", + "foriegn. \n", + "Supplement cols for the definite foreign terms, append to GoldStandard as \n", + "foreign terms.\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Update GoldStandard with edits from PerhapsForeign result\n", + "\n", + "# Update current log file from PerhapsForeign result\n", + "\n", + "# Create new 'uniques' list for FuzzyWuzzy\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Second UMLS API clean-up - Create logAfterUmlsApi2 as an \n", + "# update to logAfterUmlsApi by appending newUmlsWithSemanticGroupData\n", + "# ===========================================================================\n", + "'''\n", + "Use this AFTER you do a SECOND run against the UMLS Metathesaurus API.\n", + "\n", + "Re-start: \n", + "logAfterUmlsApi2 = pd.read_excel(localDir + 'logAfterUmlsApi2.xlsx')\n", + "'''\n", + "\n", + "logAfterUmlsApi2 = pd.read_excel(localDir + 'logAfterUmlsApi1.xlsx')\n", + "\n", + "# FIXME - Remove after this is fixed within the fixme above.\n", + "logAfterUmlsApi2 = logAfterUmlsApi2.sort_values(by='adjustedQueryCase', ascending=False)\n", + "logAfterUmlsApi2 = logAfterUmlsApi2.reset_index()\n", + "logAfterUmlsApi2.drop(['index'], axis=1, inplace=True)\n", + "\n", + "\n", + "# Join new UMLS API adds to the current search log master\n", + "logAfterUmlsApi2 = pd.merge(logAfterUmlsApi, newUmlsWithSemanticGroupData, how='left', on='adjustedQueryCase')\n", + "\n", + "# Future: Look for a better way to do the above - MERGE WITH CONDITIONAL OVERWRITE. Temporary fix:\n", + "logAfterUmlsApi2['preferredTerm2'] = logAfterUmlsApi2['preferredTerm_x'].where(logAfterUmlsApi2['preferredTerm_x'].notnull(), logAfterUmlsApi2['preferredTerm_y'])\n", + "logAfterUmlsApi2['SemanticTypeName2'] = logAfterUmlsApi2['SemanticTypeName_x'].where(logAfterUmlsApi2['SemanticTypeName_x'].notnull(), logAfterUmlsApi2['SemanticTypeName_y'])\n", + "logAfterUmlsApi2['SemanticGroupCode2'] = logAfterUmlsApi2['SemanticGroupCode_x'].where(logAfterUmlsApi2['SemanticGroupCode_x'].notnull(), logAfterUmlsApi2['SemanticGroupCode_y'])\n", + "logAfterUmlsApi2['SemanticGroup2'] = logAfterUmlsApi2['SemanticGroup_x'].where(logAfterUmlsApi2['SemanticGroup_x'].notnull(), logAfterUmlsApi2['SemanticGroup_y'])\n", + "logAfterUmlsApi2['BranchPosition2'] = logAfterUmlsApi2['BranchPosition_x'].where(logAfterUmlsApi2['BranchPosition_x'].notnull(), logAfterUmlsApi2['BranchPosition_y'])\n", + "logAfterUmlsApi2['CustomTreeNumber2'] = logAfterUmlsApi2['CustomTreeNumber_x'].where(logAfterUmlsApi2['CustomTreeNumber_x'].notnull(), logAfterUmlsApi2['CustomTreeNumber_y'])\n", + "logAfterUmlsApi2.drop(['preferredTerm_x', 'preferredTerm_y',\n", + " 'SemanticTypeName_x', 'SemanticTypeName_y',\n", + " 'SemanticGroup_x', 'SemanticGroup_y',\n", + " 'SemanticGroupCode_x', 'SemanticGroupCode_y',\n", + " 'BranchPosition_x', 'BranchPosition_y', \n", + " 'CustomTreeNumber_x', 'CustomTreeNumber_y'], axis=1, inplace=True)\n", + "logAfterUmlsApi2.rename(columns={'preferredTerm2': 'preferredTerm',\n", + " 'SemanticTypeName2': 'SemanticTypeName',\n", + " 'SemanticGroup2': 'SemanticGroup',\n", + " 'SemanticGroupCode2': 'SemanticGroupCode',\n", + " 'BranchPosition2': 'BranchPosition',\n", + " 'CustomTreeNumber2': 'CustomTreeNumber'\n", + " }, inplace=True)\n", + "\n", + "# Save to file so you can open in future sessions, if needed\n", + "writer = pd.ExcelWriter(localDir + 'logAfterUmlsApi2.xlsx')\n", + "logAfterUmlsApi2.to_excel(writer,'logAfterUmlsApi2')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "\n", + "# -----------------------------------------------\n", + "# Create files to assign Semantic Types manually\n", + "# -----------------------------------------------\n", + "'''\n", + "If you want to add matches manually using two spreadsheet windows\n", + "To do in Python - cluster:\n", + " - Probable person names\n", + " - Probable NLM products, services, web pages\n", + " - Probable journal names\n", + "'''\n", + "\n", + "col = ['SemanticGroup', 'SemanticTypeName', 'Definition', 'Examples']\n", + "SemRef = SemanticNetworkReference[col]\n", + "\n", + "# Get class distributions if you want to bolster under-represented sem types\n", + "\n", + "currentSemTypeCount = GoldStandard['SemanticTypeName'].value_counts()\n", + "currentSemTypeCount = pd.DataFrame({'TypeCount':currentSemTypeCount})\n", + "currentSemTypeCount.sort_values(\"TypeCount\", ascending=True, inplace=True)\n", + "currentSemTypeCount = currentSemTypeCount.reset_index()\n", + "currentSemTypeCount = currentSemTypeCount.rename(columns={'index': 'SemanticTypeName'})\n", + "\n", + "\n", + "\n", + "# ------------------------------------\n", + "# Visualize results - logAfterUmlsApi2\n", + "# ------------------------------------\n", + " \n", + "# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/\n", + "totCount = len(logAfterUmlsApi2)\n", + "unassigned = logAfterUmlsApi2['SemanticGroup'].isnull().sum()\n", + "assigned = totCount - unassigned\n", + "labels = ['Assigned', 'Unassigned']\n", + "sizes = [assigned, unassigned]\n", + "colors = ['lightskyblue', 'lightcoral']\n", + "explode = (0.1, 0) # explode 1st slice\n", + "plt.pie(sizes, explode=explode, labels=labels, colors=colors,\n", + " autopct='%1.f%%', shadow=True, startangle=100)\n", + "plt.axis('equal')\n", + "plt.title(\"Status after 'UMLS API 2' processing\")\n", + "plt.show()\n", + "\n", + "# Bar of SemanticGroup categories, horizontal\n", + "# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html\n", + "ax = logAfterUmlsApi2['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6),\n", + " color=\"slateblue\", fontsize=10);\n", + "ax.set_alpha(0.8)\n", + "ax.set_title(\"Categories assigned after 'UMLS API 2' processing\", fontsize=14)\n", + "ax.set_xlabel(\"Number of searches\", fontsize=9);\n", + "# set individual bar lables using above list\n", + "for i in ax.patches:\n", + " # get_width pulls left or right; get_y pushes up or down\n", + " ax.text(i.get_width()+.1, i.get_y()+.31, \\\n", + " str(round((i.get_width()), 2)), fontsize=9, color='dimgrey')\n", + "# invert for largest on top \n", + "ax.invert_yaxis()\n", + "plt.gcf().subplots_adjust(left=0.3)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 6. Create new 'uniques' dataframe/file for fuzzy matching\n", + "# ===========================================================================\n", + "'''\n", + "Re-start\n", + "\n", + "logAfterUmlsApi1 = pd.read_excel(localDir + 'logAfterUmlsApi1.xlsx')\n", + "\n", + "# Set a date range\n", + "AprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')]\n", + "\n", + "logAfterUmlsApi2 = AprMay\n", + "\n", + "# Restrict to NLM Home\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "logAfterUmlsApi2 = logAfterUmlsApi2[logAfterUmlsApi2.Referrer.str.contains('|'.join(searchfor))]\n", + "\n", + "# Set a date range\n", + "AprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')]\n", + "\n", + "logAfterUmlsApi2 = AprMay\n", + "'''\n", + "\n", + "\n", + "listOfUniqueUnassignedAfterUmls2 = logAfterUmlsApi2[pd.isnull(logAfterUmlsApi2['preferredTerm'])]\n", + "listOfUniqueUnassignedAfterUmls2 = listOfUniqueUnassignedAfterUmls2.groupby('adjustedQueryCase').size()\n", + "listOfUniqueUnassignedAfterUmls2 = pd.DataFrame({'timesSearched':listOfUniqueUnassignedAfterUmls2})\n", + "listOfUniqueUnassignedAfterUmls2 = listOfUniqueUnassignedAfterUmls2.sort_values(by='timesSearched', ascending=False)\n", + "listOfUniqueUnassignedAfterUmls2 = listOfUniqueUnassignedAfterUmls2.reset_index()\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'unassignedToCheck2.xlsx')\n", + "listOfUniqueUnassignedAfterUmls2.to_excel(writer,'unassignedToCheck')\n", + "writer.save()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/02_Run_APIs.py b/02_Run_APIs.py new file mode 100644 index 0000000..211b5b3 --- /dev/null +++ b/02_Run_APIs.py @@ -0,0 +1,983 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Created on Thu Jun 28 15:33:33 2018 + +@author: dan.wendling@nih.gov + +Last modified: 2018-07-09 + + +---------------- +SCRIPT CONTENTS +---------------- + +1. Start-up +2. UmlsApi1 - Normalized string matching +3. Isolate entries updated by API, complete tagging, and match to the + current version of the search log - logAfterUmlsApi +4. Create logAfterUmlsApi as an update to logAfterGoldStandard by appending + newUmlsWithSemanticGroupData +5. Update GoldStandard +6. Create new 'uniques' dataframe/file for fuzzy matching + + +7. Google Translate API, https://cloud.google.com/translate/ +But it's not free; https://stackoverflow.com/questions/37667671/is-it-possible-to-access-to-google-translate-api-for-free + +8. UmlsApi2 - Tag non-English terms in Roman character sets + + + +7. UmlsApi3 - Word matching (relax prediction rules ) + +8. RxNorm API + +9. UmlsApi4 - Re-run first config - Create logAfterUmlsApi4 as an + update to logAfterUmlsApi by + + append newUmlsWithSemanticGroupData + +10. Create updated training file (GoldStandard) for ML script + + +---------------------------------- +FIXME - DAN'S TO-DO ITEMS FOR DAN +---------------------------------- + +Change SemanticNetworkReference.UniqueID to SemanticTypeCode +Add SemanticNetworkReference.SemanticTypeCode to what goes into the logs, for ML. + + +---------- +RESOURCES +---------- + +Register at UMLS, get a UMLS-UTS API key, and add it below. This is the +primary source for Semantic Type classifications. +https://documentation.uts.nlm.nih.gov/rest/authentication.html +UMLS quick start: +UMLS description of what Normalized String option is, +https://uts.nlm.nih.gov/doc/devGuide/webservices/metaops/find/find2.html +""" + + +#%% +# ============================================ +# 1. Start-up / What to put into place, where +# ============================================ + +import pandas as pd +import matplotlib.pyplot as plt +from matplotlib.pyplot import pie, axis, show +import numpy as np +import requests +import json +import lxml.html as lh +from lxml.html import fromstring +import time +import os + +# Set working directory +os.chdir('/Users/wendlingd/Projects/webDS/_util') + + +localDir = '02_Run_APIs_files/' + +# If you're starting a new session an this is not already open +listOfUniqueUnassignedAfterGS = '01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx' +listOfUniqueUnassignedAfterGS = pd.read_excel(listOfUniqueUnassignedAfterGS) + +# Bring in historical file of (somewhat edited) matches +GoldStandard = '01_Pre-processing_files/GoldStandard.xlsx' +GoldStandard = pd.read_excel(GoldStandard) + + + +# Get API key ('e2f15391-871e-4b54-86e5-0a52e1a879cc') +def get_umls_api_key(filename=None): + key = os.environ.get('UMLS_API_KEY', None) + if key is not None: + return key + if filename is None: + path = os.environ.get('HOME', None) + if path is None: + path = os.environ.get('USERPROFILE', None) + if path is None: + path = '.' + filename = os.path.join(path, '.umls_api_key') + with open(filename, 'r') as f: + key = f.readline().strip() + return key + +myUTSAPIkey = get_umls_api_key() + + +''' +GoldStandard.xlsx - Already-assigned term list, from UMLS and other sources, + vetted. +''' + + +''' +SemanticNetworkReference - Customized version of the list at +https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html, +to be used to put search terms into huge bins. Should be integrated into +GoldStandard and be available at the end of the ML matching process. +''' +SemanticNetworkReference = '01_Pre-processing_files/SemanticNetworkReference.xlsx' + + +''' +- Run what remains against the UMLS API. + +Requires having your own license and API key; see https://www.nlm.nih.gov/research/umls/ +Not shown here: + - In huge files I sort by count and focus on terms searched by multiple + or many people. The 'long tail' can be huge. + - I have a database of terms aready assigned. I match these before + contacting UMLS; no need to check them again. Shortens processing time. +More options: + https://documentation.uts.nlm.nih.gov/rest/home.html + https://documentation.uts.nlm.nih.gov/rest/concept/ + +''' + + + +# unassignedAfterUmls1 = pd.read_excel(localdir + 'unassignedAfterUmls1.xlsx') + +''' +Register at RxNorm, get API key, and add it below. This is for drug misspellings. +''' + +# Generate a one-day Ticket-Granting-Ticket (TGT) +tgt = requests.post('https://utslogin.nlm.nih.gov/cas/v1/api-key', data = {'apikey':myUTSAPIkey}) +# For API key get a license from https://www.nlm.nih.gov/research/umls/ +# tgt.text +response = fromstring(tgt.text) +todaysTgt = response.xpath('//form/@action')[0] + +uiUri = "https://uts-ws.nlm.nih.gov/rest/search/current?" +semUri = "https://uts-ws.nlm.nih.gov/rest/content/current/CUI/" + + + +#%% +# ========================================= +# 2. UmlsApi1 - Normalized string matching +# ========================================= +''' +In this run the API calls use the Normalized String setting. Example: +for the input string Yellow leaves, normalizedString would return two strings, +leaf yellow and leave yellow. Each string would be matched exactly to the +strings in the normalized string index to return a result. + +Re-start: +# listOfUniqueUnassignedAfterGS = pd.read_excel('01_Pre-processing_files/listOfUniqueUnassignedAfterGS.xlsx') + +listToCheck6 = pd.read_excel(localDir + 'listToCheck6.xlsx') +listToCheck7 = pd.read_excel(localDir + 'listToCheck7.xlsx') +''' + +# --------------------------------------- +# Batch rows so you can do separate runs +# Max of 5,000 rows per run +# --------------------------------------- + +# uniqueSearchTerms = search['adjustedQueryCase'].unique() + +# Reduce entry length, to focus on single concepts that UTS API can match +listOfUniqueUnassignedAfterGS = listOfUniqueUnassignedAfterGS.loc[(listOfUniqueUnassignedAfterGS['adjustedQueryCase'].str.len() <= 20) == True] + + +# listToCheck1 = unassignedAfterGS.iloc[0:20] +listToCheck1 = listOfUniqueUnassignedAfterGS.iloc[0:6000] +listToCheck2 = listOfUniqueUnassignedAfterGS.iloc[6001:12000] +listToCheck3 = listOfUniqueUnassignedAfterGS.iloc[12001:18000] +listToCheck4 = listOfUniqueUnassignedAfterGS.iloc[18001:24000] +listToCheck5 = listOfUniqueUnassignedAfterGS.iloc[24001:30000] +listToCheck6 = listOfUniqueUnassignedAfterGS.iloc[30001:36000] +listToCheck7 = listOfUniqueUnassignedAfterGS.iloc[36001:39523] + + + +''' +listToCheck1 = unassignedToCheck.iloc[12497:20000] +listToCheck2 = unassignedToCheck.iloc[20001:26000] +listToCheck3 = unassignedToCheck.iloc[23225:28000] +listToCheck4 = unassignedToCheck.iloc[28001:31256] + +mask = (unassignedToCheck['adjustedQueryCase'].str.len() <= 15) +listToCheck3 = listToCheck3.loc[mask] +listToCheck4 = listToCheck4.loc[mask] +''' + + +# If multiple sessions required, saving to file might help +writer = pd.ExcelWriter(localDir + 'listToCheck7.xlsx') +listToCheck7.to_excel(writer,'listToCheck7') +# df2.to_excel(writer,'Sheet2') +writer.save() + +writer = pd.ExcelWriter(localDir + 'listToCheck2.xlsx') +listToCheck2.to_excel(writer,'listToCheck2') +# df2.to_excel(writer,'Sheet2') +writer.save() + +''' +OPTIONS + +# Bring in from file +listToCheck3 = pd.read_excel(localDir + 'listToCheck3.xlsx') +listToCheck4 = pd.read_excel(localDir + 'listToCheck4.xlsx') + +listToCheck1 = unassignedAfterGS +listToCheck2 = unassignedAfterGS.iloc[5001:10000] +listToCheck1 = unassignedAfterGS.iloc[10001:11335] +''' + + +#%% +# ---------------------------------------------------------- +# Run this block after changing listToCheck# top and bottom +# ---------------------------------------------------------- +''' +Until you put this into a function, you need to change listToCheck# +and apiGetNormalizedString# counts every run! +Stay below 30 API requests per second. With 4 API requests per item +(2 .get and 2 .post requests)... +time.sleep commented out: 6,000 / 35 min = 171 per minute = 2.9 items per second / 11.4 requests per second +Computing differently, 6,000 items @ 4 Req per item = 24,000 Req, divided by 35 min+ +686 Req/min = 11.4 Req/sec +time.sleep(.07): ~38 minutes to do 6,000; 158 per minute / 2.6 items per second +''' + +apiGetNormalizedString = pd.DataFrame() +apiGetNormalizedString['adjustedQueryCase'] = "" +apiGetNormalizedString['preferredTerm'] = "" +apiGetNormalizedString['SemanticTypeName'] = "" + +''' +For file 6, 7/5/18 1:05 p.m.: SSLError: HTTPSConnectionPool(host='utslogin.nlm.nih.gov', +port=443): Max retries exceeded with url: +/cas/v1/api-key/TGT-480224-qLwYAMKl5cTfa7Jwb7RWZ3kfexPUm479HfddD7yVUKt79lZ0Ta-cas +(Caused by SSLError(SSLError("bad handshake: SysCallError(60, 'ETIMEDOUT')",),)) + +Later, run 6 and 7 +''' + + +for index, row in listToCheck7.iterrows(): + currLogTerm = row['adjustedQueryCase'] + # === Get 'preferred term' and its concept identifier (CUI/UI) ========= + stTicket = requests.post(todaysTgt, data = {'service':'http://umlsks.nlm.nih.gov'}) # Get single-use Service Ticket (ST) + # Example: GET https://uts-ws.nlm.nih.gov/rest/search/current?string=tylenol&sabs=MSH&ticket=ST-681163-bDfgQz5vKe2DJXvI4Snm-cas + tQuery = {'string':currLogTerm, 'searchType':'normalizedString', 'ticket':stTicket.text} # removed 'sabs':'MSH', + getPrefTerm = requests.get(uiUri, params=tQuery) + getPrefTerm.encoding = 'utf-8' + tItems = json.loads(getPrefTerm.text) + tJson = tItems["result"] + if tJson["results"][0]["ui"] != "NONE": # Sub-loop to resolve "NONE" + currUi = tJson["results"][0]["ui"] + currPrefTerm = tJson["results"][0]["name"] + # === Get 'semantic type' ========= + stTicket = requests.post(todaysTgt, data = {'service':'http://umlsks.nlm.nih.gov'}) # Get single-use Service Ticket (ST) + # Example: GET https://uts-ws.nlm.nih.gov/rest/content/current/CUI/C0699142?ticket=ST-512564-vUxzyI00ErMRm6tjefNP-cas + semQuery = {'ticket':stTicket.text} + getPrefTerm = requests.get(semUri+currUi, params=semQuery) + getPrefTerm.encoding = 'utf-8' + semItems = json.loads(getPrefTerm.text) + semJson = semItems["result"] + currSemTypes = [] + for name in semJson["semanticTypes"]: + currSemTypes.append(name["name"]) # + " ; " + # === Post to dataframe ========= + apiGetNormalizedString = apiGetNormalizedString.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, + 'preferredTerm': currPrefTerm, + 'SemanticTypeName': currSemTypes[0]}, index=[0]), ignore_index=True) + print('{} --> {}'.format(currLogTerm, currSemTypes[0])) # Write progress to console + # time.sleep(.06) + else: + # Post "NONE" to database and restart loop + apiGetNormalizedString = apiGetNormalizedString.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, 'preferredTerm': "NONE"}, index=[0]), ignore_index=True) + print('{} --> NONE'.format(currLogTerm, )) # Write progress to console + # time.sleep(.06) +print ("* Done *") + + +writer = pd.ExcelWriter(localDir + 'apiGetNormalizedString7.xlsx') +apiGetNormalizedString.to_excel(writer,'apiGetNormalizedString') +# df2.to_excel(writer,'Sheet2') +writer.save() + + +# Free up memory: Remove listToCheck, listToCheck1, listToCheck2, listToCheck3, +# listToCheck4, nonForeign, searchLog, unassignedAfterGS + + +#%% +# ================================================================== +# 3. Isolate entries updated by API, complete tagging, and match to +# the current version of the search log - logAfterUmlsApi +# ================================================================== +''' +To Do: + + Isolate new assignments and: + - merge them into the master version of the log + - add to GoldStandard for next time + + # Move unassigned entries into workflow for human identification + +To re-start + +unassignedAfterGS = pd.read_excel(localDir + 'unassignedAfterGS.xlsx') +logAfterGoldStandard = pd.read_excel(localDir + 'logAfterGoldStandard.xlsx') + +listFromApi = pd.read_excel('02_UMLS_API_files/listFromApi1-April-May.xlsx') +assignedByUmlsApi = pd.read_excel(localDir + 'assignedByUmlsApi.xlsx') + +# Fix temporary issue of nulls in SemanticTypeName, and wrong col name semTypeName + +listFromApi.drop(['SemanticTypeName'], axis=1, inplace=True) +listFromApi.rename(columns={'semTypeName': 'SemanticTypeName'}, inplace=True) + +# listFromApi = listFromApi.dropna(subset=['SemanticTypeName']) + ''' + + +# If you stored output from UMLS API in files, re-open and unite +newAssignments1 = pd.read_excel(localDir + 'apiGetNormalizedString1.xlsx') +newAssignments2 = pd.read_excel(localDir + 'apiGetNormalizedString2.xlsx') +newAssignments3 = pd.read_excel(localDir + 'apiGetNormalizedString3.xlsx') +newAssignments4 = pd.read_excel(localDir + 'apiGetNormalizedString4.xlsx') +newAssignments5 = pd.read_excel(localDir + 'apiGetNormalizedString5.xlsx') +newAssignments6 = pd.read_excel(localDir + 'apiGetNormalizedString6.xlsx') +newAssignments7 = pd.read_excel(localDir + 'apiGetNormalizedString7.xlsx') + + +# Put dataframes together into one; df = df1.append([df2, df3]) +afterUmlsApi1 = newAssignments1.append([newAssignments2, newAssignments3, newAssignments4, newAssignments5]) +afterUmlsApi1 = newAssignments6.append([newAssignments7]) + + +''' +afterUmlsApi1 = afterUmlsApi1.append(newAssignments3) +afterUmlsApi1 = afterUmlsApi1.append(newAssignments4) +''' + + +# If you only used one df for listFromApi +# afterUMLSapi = listFromApi +# assignedByUmlsApi = listFromApi + + +# Reduce to a version that has only successful assignments + +# Remove various problem entries +assignedByUmlsApi1 = afterUmlsApi1.loc[(afterUmlsApi1['preferredTerm'] != "NONE")] +assignedByUmlsApi1 = assignedByUmlsApi1[~pd.isnull(assignedByUmlsApi1['preferredTerm'])] +assignedByUmlsApi1 = assignedByUmlsApi1.loc[(assignedByUmlsApi1['preferredTerm'] != "Null Value")] +assignedByUmlsApi1 = assignedByUmlsApi1[~pd.isnull(assignedByUmlsApi1['adjustedQueryCase'])] + + +# If you want to send to Excel +writer = pd.ExcelWriter(localDir + 'assignedByUmlsApi1.xlsx') +assignedByUmlsApi1.to_excel(writer,'assignedByUmlsApi1') +# df2.to_excel(writer,'Sheet2') +writer.save() + + +# Bring in subject category master file +# SemanticNetworkReference = pd.read_excel(localDir + 'SemanticNetworkReference.xlsx') +SemanticNetworkReference = pd.read_excel(SemanticNetworkReference) + +# Reduce to required cols +SemTypeData = SemanticNetworkReference[['SemanticTypeName', 'SemanticGroupCode', 'SemanticGroup', 'CustomTreeNumber', 'BranchPosition']] +# SemTypeData.rename(columns={'SemanticTypeName': 'semTypeName'}, inplace=True) # The join col + +# Add more semantic tagging to new UMLS API adds +newUmlsWithSemanticGroupData = pd.merge(assignedByUmlsApi1, SemTypeData, how='left', on='SemanticTypeName') + + +#%% +# ============================================================================ +# 4. Create logAfterUmlsApi as an update to logAfterGoldStandard by appending +# newUmlsWithSemanticGroupData +# ============================================================================ + +''' +Depending on what you're processing, use this or the next section of the below. + +Depends on how you choose to process - Like, down to one occurrence to API +in first batch, or not. +''' + + +logAfterGoldStandard = '01_Pre-processing_files/logAfterGoldStandard.xlsx' +logAfterGoldStandard = pd.read_excel(logAfterGoldStandard) + + +''' +# FIXME - Remove after this is fixed within the fixme above. +logAfterGoldStandard = logAfterGoldStandard.sort_values(by='adjustedQueryCase', ascending=True) +logAfterGoldStandard = logAfterGoldStandard.reset_index() +logAfterGoldStandard.drop(['index'], axis=1, inplace=True) +''' + + +# Eyeball. If you need to remove rows... +# logAfterGoldStandard = logAfterGoldStandard.iloc[760:] # remove before index... + +# Join new UMLS API adds to the current search log master +logAfterUmlsApi1 = pd.merge(logAfterGoldStandard, newUmlsWithSemanticGroupData, how='left', on='adjustedQueryCase') + +logAfterUmlsApi1.columns + +''' +['SessionID', 'StaffYN', 'Referrer', 'Query', 'Timestamp', + 'adjustedQueryCase', 'SemanticTypeName_x', 'SemanticGroup_x', + 'SemanticGroupCode_x', 'BranchPosition_x', 'CustomTreeNumber_x', + 'ResourceType', 'Address', 'EntrySource', 'contentSteward', + 'preferredTerm_x', 'SemanticTypeName_y', 'preferredTerm_y', + 'SemanticGroupCode_y', 'SemanticGroup_y', 'CustomTreeNumber_y', + 'BranchPosition_y'] + +''' + + +# Future: Look for a better way to do the above - MERGE WITH CONDITIONAL OVERWRITE. Temporary fix: +logAfterUmlsApi1['preferredTerm2'] = logAfterUmlsApi1['preferredTerm_x'].where(logAfterUmlsApi1['preferredTerm_x'].notnull(), logAfterUmlsApi1['preferredTerm_y']) +logAfterUmlsApi1['SemanticTypeName2'] = logAfterUmlsApi1['SemanticTypeName_x'].where(logAfterUmlsApi1['SemanticTypeName_x'].notnull(), logAfterUmlsApi1['SemanticTypeName_y']) +logAfterUmlsApi1['SemanticGroup2'] = logAfterUmlsApi1['SemanticGroup_x'].where(logAfterUmlsApi1['SemanticGroup_x'].notnull(), logAfterUmlsApi1['SemanticGroup_y']) +logAfterUmlsApi1['SemanticGroupCode2'] = logAfterUmlsApi1['SemanticGroupCode_x'].where(logAfterUmlsApi1['SemanticGroupCode_x'].notnull(), logAfterUmlsApi1['SemanticGroupCode_y']) +logAfterUmlsApi1['BranchPosition2'] = logAfterUmlsApi1['BranchPosition_x'].where(logAfterUmlsApi1['BranchPosition_x'].notnull(), logAfterUmlsApi1['BranchPosition_y']) +logAfterUmlsApi1['CustomTreeNumber2'] = logAfterUmlsApi1['CustomTreeNumber_x'].where(logAfterUmlsApi1['CustomTreeNumber_x'].notnull(), logAfterUmlsApi1['CustomTreeNumber_y']) +logAfterUmlsApi1.drop(['preferredTerm_x', 'preferredTerm_y', + 'SemanticTypeName_x', 'SemanticTypeName_y', + 'SemanticGroup_x', 'SemanticGroup_y', + 'SemanticGroupCode_x', 'SemanticGroupCode_y', + 'BranchPosition_x', 'BranchPosition_y', + 'CustomTreeNumber_x', 'CustomTreeNumber_y'], axis=1, inplace=True) +logAfterUmlsApi1.rename(columns={'preferredTerm2': 'preferredTerm', + 'SemanticTypeName2': 'SemanticTypeName', + 'SemanticGroup2': 'SemanticGroup', + 'SemanticGroupCode2': 'SemanticGroupCode', + 'BranchPosition2': 'BranchPosition', + 'CustomTreeNumber2': 'CustomTreeNumber' + }, inplace=True) + +# Save to file so you can open in future sessions, if needed +writer = pd.ExcelWriter(localDir + 'logAfterUmlsApi1.xlsx') +logAfterUmlsApi1.to_excel(writer,'logAfterUmlsApi1') +# df2.to_excel(writer,'Sheet2') +writer.save() + +''' +To Do: + - Create list of unmatched terms with freq + - Cluster similar spellings together? + +- Look at "Not currently matchable" terms with "high" frequency counts. Eyeball to see if these were incorrectly matched in the past; assign historical term or update all to new term, save in gold standard file. +- Process entries from the PubMed product page. +- If you haven't done so, update RegEx list to improve future matching. +- Every several months, through Flask interface, interactively update the gold standard, manually. + +# Reduce logAfterUmlsApi to unique, unmatched entries, prep for ML + +To re-start: +logAfterUmlsApi = pd.read_excel(localDir + 'logAfterUmlsApi.xlsx') +''' + + +# ------------------------------------ +# Visualize results - logAfterUmlsApi +# ------------------------------------ + +# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/ +totCount = len(logAfterUmlsApi1) +unassigned = logAfterUmlsApi1['SemanticGroup'].isnull().sum() +assigned = totCount - unassigned +labels = ['Assigned', 'Unassigned'] +sizes = [assigned, unassigned] +colors = ['lightskyblue', 'lightcoral'] +explode = (0.1, 0) # explode 1st slice +plt.pie(sizes, explode=explode, labels=labels, colors=colors, + autopct='%1.f%%', shadow=True, startangle=100) +plt.axis('equal') +plt.title("Status after 'UMLS API' processing") +plt.show() + +# Bar of SemanticGroup categories, horizontal +# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html +ax = logAfterUmlsApi1['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6), + color="slateblue", fontsize=10); +ax.set_alpha(0.8) +ax.set_title("Categories assigned after 'UMLS API' processing", fontsize=14) +ax.set_xlabel("Number of searches", fontsize=9); +# set individual bar lables using above list +for i in ax.patches: + # get_width pulls left or right; get_y pushes up or down + ax.text(i.get_width()+.1, i.get_y()+.31, \ + str(round((i.get_width()), 2)), fontsize=9, color='dimgrey') +# invert for largest on top +ax.invert_yaxis() +plt.gcf().subplots_adjust(left=0.3) + +# Remove listOfUniqueUnassignedAfterGS, listToCheck1, etc., logAfterGoldStandard, logAfterUmlsApi1, +# newAssignments1 etc. + + + +#%% +# ======================= +# 5. Update GoldStandard +# ======================= + +# Open GoldStandard if needed +GoldStandard = '01_Pre-processing_files/GoldStandard.xlsx' +GoldStandard = pd.read_excel(GoldStandard) + +# Append fully tagged UMLS API adds to GoldStandard +GoldStandard = GoldStandard.append(newUmlsWithSemanticGroupData, sort=False) + +# Reset index +GoldStandard = GoldStandard.reset_index() +GoldStandard.drop(['index'], axis=1, inplace=True) +# temp GoldStandard.drop(['adjustedQueryCase'], axis=1, inplace=True) + +''' +Eyeball top and bottom of cols, remove rows by Index, if needed + +GoldStandard.drop(58027, inplace=True) +''' + + +# Write out the updated GoldStandard +writer = pd.ExcelWriter('01_Pre-processing_files/GoldStandard.xlsx') +GoldStandard.to_excel(writer,'GoldStandard') +writer.save() + + + +#%% +# ============================================================================ +# 6. Start new 'uniques' dataframe that gets new column for each of the below +# listOfUniqueUnassignedAfterUmls1 +# ============================================================================ + +''' +To Do: + - Create list of unmatched terms with freq + - Cluster similar spellings together? + +- Look at "Not currently matchable" terms with "high" frequency counts. Eyeball to see if these were incorrectly matched in the past; assign historical term or update all to new term, save in gold standard file. +- Process entries from the PubMed product page. +- If you haven't done so, update RegEx list to improve future matching. +- Every several months, through Flask interface, interactively update the gold standard, manually. + +# Reduce logAfterUmlsApi to unique, unmatched entries, prep for ML + +To re-start: +logAfterUmlsApi = pd.read_excel(localDir + 'logAfterUmlsApi.xlsx') +''' + +listOfUniqueUnassignedAfterUmls1 = logAfterUmlsApi1[pd.isnull(logAfterUmlsApi1['SemanticGroup'])] +listOfUniqueUnassignedAfterUmls1 = listOfUniqueUnassignedAfterUmls1.groupby('adjustedQueryCase').size() +listOfUniqueUnassignedAfterUmls1 = pd.DataFrame({'timesSearched':listOfUniqueUnassignedAfterUmls1}) +listOfUniqueUnassignedAfterUmls1 = listOfUniqueUnassignedAfterUmls1.sort_values(by='timesSearched', ascending=False) +listOfUniqueUnassignedAfterUmls1 = listOfUniqueUnassignedAfterUmls1.reset_index() + +writer = pd.ExcelWriter(localDir + 'listOfUniqueUnassignedAfterUmls11.xlsx') +listOfUniqueUnassignedAfterUmls1.to_excel(writer,'unassignedToCheck') +writer.save() + +# FY 18 Q3: 57,287 + + +#%% +# ============================================================= +# 5. Google Translate API, https://cloud.google.com/translate/ +# ============================================================= +''' +But it's not free; https://stackoverflow.com/questions/37667671/is-it-possible-to-access-to-google-translate-api-for-free +''' + + +#%% +# ========================================================================== +# 5. UmlsApi2 - Tag non-English terms in Roman character sets +# ========================================================================== +''' +Some foreign terms can be matched. This run does not return a preferred term, +just returns what vocabulary the term is found in. + +Queries with words not in English are ignored by the first API run using +"normalized string" matching. Here, try flagging what you can and take them +out of the percent-complete calculation. + +The API apparently only supports U.S. English. RegEx could be used to convert +UTF-8 Roman characters that are not English... Non-Roman languages (Chinese, +Cyrillic, Arabic, Japanese, etc.) are not supported by the API; these should +be kept out of the API runs entirely. + +6/22/18, from David of UMLS support, TRACKING:000308010 + +> Can the UMLS REST API tell me the term's language? + +One option would be to specify returnIdType=sourceUi for your search +request. For example: + +https://uts-ws.nlm.nih.gov/rest/search/current?string=Infarto de miocardio&returnIdType=sourceUi&ticket= + +This will give you a set of codes back where there is a match, but will +also return a vocabulary (rootSource). If you have that, you can get +the language (in this case, Spanish). The first result may be all you +need. If you have the rootSource, you can match it to the "abbreviation" +and look up the language here: https://uts-ws.nlm.nih.gov/rest/metadata/current/sources. + +It won't be perfect. I'm seeing some problems with accented characters. +For example, coração returns no results, so that's not great, but may +not matter. Some strings will appear in multiple languages, too. + +Let me know how that works for you. - David +''' + + +# ------------------------------------------------------ +# Batch up your API runs. Re-starting, correcting, etc. +# ------------------------------------------------------ + +# uniqueSearchTerms = search['adjustedQueryCase'].unique() + +# vocabCheck1 = unassignedAfterGS.iloc[0:20] +vocabCheck1 = listOfUniqueUnassignedAfterUmls1.iloc[0:5000] +# vocabCheck2 = listOfUniqueUnassignedAfterUmls1.iloc[5001:10678] + + +# If multiple sessions required, saving to file might help +writer = pd.ExcelWriter(localDir + 'vocabCheck1.xlsx') +vocabCheck1.to_excel(writer,'vocabCheck') +# df2.to_excel(writer,'Sheet2') +writer.save() + + +''' +writer = pd.ExcelWriter(localDir + 'listToCheck2.xlsx') +listToCheck2.to_excel(writer,'listToCheck2') +# df2.to_excel(writer,'Sheet2') +writer.save() +''' + + +''' +OPTIONS + +# Bring in from file +listToCheck3 = pd.read_excel('01 Pre-process/listToCheck3.xlsx') +listToCheck4 = pd.read_excel('01 Pre-process/listToCheck4.xlsx') + +listToCheck1 = unassignedAfterGS +listToCheck2 = unassignedAfterGS.iloc[5001:10000] +listToCheck1 = unassignedAfterGS.iloc[10001:11335] +''' + + +''' +Work with PostMan app to test/approve + +David from UMLS: One option would be to specify returnIdType=sourceUi for +your search request. For example: + +https://uts-ws.nlm.nih.gov/rest/search/current?string=Infarto de miocardio&returnIdType=sourceUi&ticket= + +This will give you a set of codes back where there is a match, but will +also return a vocabulary (rootSource). If you have that, you can get the +language (in this case, Spanish). The first result may be all you need. +If you have the rootSource, you can match it to the "abbreviation" and +look up the language here: https://uts-ws.nlm.nih.gov/rest/metadata/current/sources. + +It won't be perfect. I'm seeing some problems with accented characters. +For example, coração returns no results, so that's not great, but may +not matter. Some strings will appear in multiple languages, too. + +Let me know how that works for you. +''' + +''' +FIXME - Unfinished. + +TGT-16294-ajZgfOTNGBxvzAXAvQslZtuL2U0HksFsED6tZ0ajoewNBNdSVz-cas + + +# THIS IS SOURCE VOCAB CODE + +https://uts-ws.nlm.nih.gov/rest/search/current?string=Infarto de miocardio&returnIdType=sourceUi&ticket= +''' + + +#%% +# ----------------------------------- +# Gather list of source vocabularies +# ----------------------------------- + +uiUri = "https://uts-ws.nlm.nih.gov/rest/search/current?returnIdType=sourceUi" + +listOfSourceVocabularies = pd.DataFrame() +listOfSourceVocabularies['adjustedQueryCase'] = "" +listOfSourceVocabularies['sourceVocab'] = "" + +for index, row in listToCheck1.iterrows(): + currLogTerm = row['adjustedQueryCase'] + # === Get 'source vocab' ========= + stTicket = requests.post(todaysTgt, data = {'service':'http://umlsks.nlm.nih.gov'}) # Get single-use Service Ticket (ST) + # Example: GET https://uts-ws.nlm.nih.gov/rest/search/current?string=tylenol&sabs=MSH&ticket=ST-681163-bDfgQz5vKe2DJXvI4Snm-cas + termQuery = {'string':currLogTerm, 'ticket':stTicket.text} # removed 'searchType':'word' (it's the default), 'sabs':'MSH', + getSourceVocab = requests.get(uiUri, params=termQuery) + getSourceVocab.encoding = 'utf-8' + tItems = json.loads(getSourceVocab.text) + tJson = tItems["result"] + if tJson["results"][0]["ui"] != "NONE": # Sub-loop to resolve "NONE" + currUi = tJson["results"][0]["rootSource"] + sourceVocab = tJson["results"][0]["rootSource"] + # === Post to dataframe ========= + listOfSourceVocabularies = listOfSourceVocabularies.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, + 'sourceVocab': sourceVocab}, index=[0]), ignore_index=True) + print('{} --> {}'.format(currLogTerm, sourceVocab)) # Write progress to console + time.sleep(.07) + else: + # Post "NONE" to database and restart loop + listOfSourceVocabularies = listOfSourceVocabularies.append(pd.DataFrame({'adjustedQueryCase': currLogTerm, 'sourceVocab': "NONE"}, index=[0]), ignore_index=True) + print('{} --> NONE'.format(currLogTerm, )) # Write progress to console + time.sleep(.07) +print ("* Done *") + + +writer = pd.ExcelWriter(localDir + 'listOfSourceVocabularies.xlsx') +listOfSourceVocabularies.to_excel(writer,'listOfSourceVocabularies') +# df2.to_excel(writer,'Sheet2') +writer.save() + +# Free up memory: Remove listToCheck, listToCheck1, listToCheck2, listToCheck3, +# listToCheck4, nonForeign, searchLog, unassignedAfterGS + + +#%% + +# Load external reference file: SourceVocabsForeign.xlsx + +# F&R Foreign vocab names with the language name, "Spanish," "Swedish" + +# Append to running list of updates + + + +# ------------------------------------------------------ +# Match vocabCheck +# ------------------------------------------------------ +''' +FIXME - RESULTING LIST NEEDS TO BE VETTED; START WITH HIGHEST-FREQUENCY USE. + +Update naming? this is the result from the API run for languages + +This custom list of vocabs does not include and English vocabs, therefore, +only foreign matches are returned, which is what we want. + +Re-start: +listOfSourceVocabularies = pd.read_excel(localDir + 'listOfSourceVocabularies.xlsx') +''' + +# Load list of Non-English vocabularies +# 7/5/2018, https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html (English vocabs not included.) +UMLS_NonEnglish_Vocabularies = pd.read_excel(localDir + 'UMLS_Non-English_Vocabularies.xlsx') + +# Inner join +foreignButEnglishChar = pd.merge(listOfSourceVocabularies, UMLS_NonEnglish_Vocabularies, how='inner', left_on='sourceVocab', right_on='Vocabulary') + + +# Get frequency count, reduce cols for easier manual checking +PerhapsForeign = pd.merge(foreignButEnglishChar, listOfUniqueUnassignedAfterUmls1, how='inner', on='adjustedQueryCase') + +PerhapsForeign = PerhapsForeign.sort_values(by='timesSearched', ascending=False) +PerhapsForeign = PerhapsForeign.reset_index() +PerhapsForeign.drop(['index'], axis=1, inplace=True) +col = ['adjustedQueryCase', 'timesSearched', 'Language'] +PerhapsForeign = PerhapsForeign[col] +PerhapsForeign.rename(columns={'Language': 'LanguageGuess'}, inplace=True) + +# Send out for manual checking +writer = pd.ExcelWriter(localDir + 'PerhapsForeign.xlsx') +PerhapsForeign.to_excel(writer,'PerhapsForeign') +# df2.to_excel(writer,'Sheet2') +writer.save() + +''' +In Excel or Flask, delete rows with terms that we use in English; check that +tyhe remaining rows contain terms that most English speakers would think are +foriegn. +Supplement cols for the definite foreign terms, append to GoldStandard as +foreign terms. +''' + +#%% + +# Update GoldStandard with edits from PerhapsForeign result + +# Update current log file from PerhapsForeign result + +# Create new 'uniques' list for FuzzyWuzzy + + + + + + +#%% +# =========================================================================== +# 5. Second UMLS API clean-up - Create logAfterUmlsApi2 as an +# update to logAfterUmlsApi by appending newUmlsWithSemanticGroupData +# =========================================================================== +''' +Use this AFTER you do a SECOND run against the UMLS Metathesaurus API. + +Re-start: +logAfterUmlsApi2 = pd.read_excel(localDir + 'logAfterUmlsApi2.xlsx') +''' + +logAfterUmlsApi2 = pd.read_excel(localDir + 'logAfterUmlsApi1.xlsx') + +# FIXME - Remove after this is fixed within the fixme above. +logAfterUmlsApi2 = logAfterUmlsApi2.sort_values(by='adjustedQueryCase', ascending=False) +logAfterUmlsApi2 = logAfterUmlsApi2.reset_index() +logAfterUmlsApi2.drop(['index'], axis=1, inplace=True) + + +# Join new UMLS API adds to the current search log master +logAfterUmlsApi2 = pd.merge(logAfterUmlsApi, newUmlsWithSemanticGroupData, how='left', on='adjustedQueryCase') + +# Future: Look for a better way to do the above - MERGE WITH CONDITIONAL OVERWRITE. Temporary fix: +logAfterUmlsApi2['preferredTerm2'] = logAfterUmlsApi2['preferredTerm_x'].where(logAfterUmlsApi2['preferredTerm_x'].notnull(), logAfterUmlsApi2['preferredTerm_y']) +logAfterUmlsApi2['SemanticTypeName2'] = logAfterUmlsApi2['SemanticTypeName_x'].where(logAfterUmlsApi2['SemanticTypeName_x'].notnull(), logAfterUmlsApi2['SemanticTypeName_y']) +logAfterUmlsApi2['SemanticGroupCode2'] = logAfterUmlsApi2['SemanticGroupCode_x'].where(logAfterUmlsApi2['SemanticGroupCode_x'].notnull(), logAfterUmlsApi2['SemanticGroupCode_y']) +logAfterUmlsApi2['SemanticGroup2'] = logAfterUmlsApi2['SemanticGroup_x'].where(logAfterUmlsApi2['SemanticGroup_x'].notnull(), logAfterUmlsApi2['SemanticGroup_y']) +logAfterUmlsApi2['BranchPosition2'] = logAfterUmlsApi2['BranchPosition_x'].where(logAfterUmlsApi2['BranchPosition_x'].notnull(), logAfterUmlsApi2['BranchPosition_y']) +logAfterUmlsApi2['CustomTreeNumber2'] = logAfterUmlsApi2['CustomTreeNumber_x'].where(logAfterUmlsApi2['CustomTreeNumber_x'].notnull(), logAfterUmlsApi2['CustomTreeNumber_y']) +logAfterUmlsApi2.drop(['preferredTerm_x', 'preferredTerm_y', + 'SemanticTypeName_x', 'SemanticTypeName_y', + 'SemanticGroup_x', 'SemanticGroup_y', + 'SemanticGroupCode_x', 'SemanticGroupCode_y', + 'BranchPosition_x', 'BranchPosition_y', + 'CustomTreeNumber_x', 'CustomTreeNumber_y'], axis=1, inplace=True) +logAfterUmlsApi2.rename(columns={'preferredTerm2': 'preferredTerm', + 'SemanticTypeName2': 'SemanticTypeName', + 'SemanticGroup2': 'SemanticGroup', + 'SemanticGroupCode2': 'SemanticGroupCode', + 'BranchPosition2': 'BranchPosition', + 'CustomTreeNumber2': 'CustomTreeNumber' + }, inplace=True) + +# Save to file so you can open in future sessions, if needed +writer = pd.ExcelWriter(localDir + 'logAfterUmlsApi2.xlsx') +logAfterUmlsApi2.to_excel(writer,'logAfterUmlsApi2') +# df2.to_excel(writer,'Sheet2') +writer.save() + + + +# ----------------------------------------------- +# Create files to assign Semantic Types manually +# ----------------------------------------------- +''' +If you want to add matches manually using two spreadsheet windows +To do in Python - cluster: + - Probable person names + - Probable NLM products, services, web pages + - Probable journal names +''' + +col = ['SemanticGroup', 'SemanticTypeName', 'Definition', 'Examples'] +SemRef = SemanticNetworkReference[col] + +# Get class distributions if you want to bolster under-represented sem types + +currentSemTypeCount = GoldStandard['SemanticTypeName'].value_counts() +currentSemTypeCount = pd.DataFrame({'TypeCount':currentSemTypeCount}) +currentSemTypeCount.sort_values("TypeCount", ascending=True, inplace=True) +currentSemTypeCount = currentSemTypeCount.reset_index() +currentSemTypeCount = currentSemTypeCount.rename(columns={'index': 'SemanticTypeName'}) + + + +# ------------------------------------ +# Visualize results - logAfterUmlsApi2 +# ------------------------------------ + +# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/ +totCount = len(logAfterUmlsApi2) +unassigned = logAfterUmlsApi2['SemanticGroup'].isnull().sum() +assigned = totCount - unassigned +labels = ['Assigned', 'Unassigned'] +sizes = [assigned, unassigned] +colors = ['lightskyblue', 'lightcoral'] +explode = (0.1, 0) # explode 1st slice +plt.pie(sizes, explode=explode, labels=labels, colors=colors, + autopct='%1.f%%', shadow=True, startangle=100) +plt.axis('equal') +plt.title("Status after 'UMLS API 2' processing") +plt.show() + +# Bar of SemanticGroup categories, horizontal +# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html +ax = logAfterUmlsApi2['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6), + color="slateblue", fontsize=10); +ax.set_alpha(0.8) +ax.set_title("Categories assigned after 'UMLS API 2' processing", fontsize=14) +ax.set_xlabel("Number of searches", fontsize=9); +# set individual bar lables using above list +for i in ax.patches: + # get_width pulls left or right; get_y pushes up or down + ax.text(i.get_width()+.1, i.get_y()+.31, \ + str(round((i.get_width()), 2)), fontsize=9, color='dimgrey') +# invert for largest on top +ax.invert_yaxis() +plt.gcf().subplots_adjust(left=0.3) + + + +#%% +# =========================================================================== +# 6. Create new 'uniques' dataframe/file for fuzzy matching +# =========================================================================== +''' +Re-start + +logAfterUmlsApi1 = pd.read_excel(localDir + 'logAfterUmlsApi1.xlsx') + +# Set a date range +AprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')] + +logAfterUmlsApi2 = AprMay + +# Restrict to NLM Home +searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$'] +logAfterUmlsApi2 = logAfterUmlsApi2[logAfterUmlsApi2.Referrer.str.contains('|'.join(searchfor))] + +# Set a date range +AprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')] + +logAfterUmlsApi2 = AprMay + +''' + + + +listOfUniqueUnassignedAfterUmls2 = logAfterUmlsApi2[pd.isnull(logAfterUmlsApi2['preferredTerm'])] +listOfUniqueUnassignedAfterUmls2 = listOfUniqueUnassignedAfterUmls2.groupby('adjustedQueryCase').size() +listOfUniqueUnassignedAfterUmls2 = pd.DataFrame({'timesSearched':listOfUniqueUnassignedAfterUmls2}) +listOfUniqueUnassignedAfterUmls2 = listOfUniqueUnassignedAfterUmls2.sort_values(by='timesSearched', ascending=False) +listOfUniqueUnassignedAfterUmls2 = listOfUniqueUnassignedAfterUmls2.reset_index() + +writer = pd.ExcelWriter(localDir + 'unassignedToCheck2.xlsx') +listOfUniqueUnassignedAfterUmls2.to_excel(writer,'unassignedToCheck') +writer.save() diff --git a/03_Fuzzy_match.ipynb b/03_Fuzzy_match.ipynb new file mode 100644 index 0000000..02c26a6 --- /dev/null +++ b/03_Fuzzy_match.ipynb @@ -0,0 +1,530 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 3. Fuzzy match\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** For training ML algorithms: Post fuzzy match candidates to browser so user can select manually
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "\n", + "## Script contents\n", + "\n", + "1. Start-up / What to put into place, where\n", + "2. FuzzyWuzzyListToAdd - FuzzyWuzzy matching\n", + "3. Add result to MySQL, process at http://localhost:5000/fuzzy/\n", + " (Use browser to update MySQL table)\n", + "4. Bring data from manual_assignments back into Pandas\n", + "5. Update log and GoldStandard with new matches from MySQL\n", + "6. Create new 'uniques' dataframe/file for ML\n", + "7. Next steps\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 1. Start-up / What to put into place, where\n", + "# ============================================\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import requests\n", + "import json\n", + "import lxml.html as lh\n", + "from lxml.html import fromstring\n", + "import time\n", + "import os\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/Projects/webDS/_util')\n", + "\n", + "localDir = '03_Fuzzy_match_files/'\n", + "\n", + "\n", + "\n", + "# Bring in historical file of (somewhat edited) matches\n", + "GoldStandard = '01_Text_wrangling_files/GoldStandard_master.xlsx'\n", + "GoldStandard = pd.read_excel(GoldStandard)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. FuzzyWuzzyListToAdd - FuzzyWuzzy matching\n", + "5,000 in ~25 minutes; ~10,000 in ~50 minutes (at score_cutoff=85)\n", + "\n", + "Fuzzy match can be applied to an entire column of dataset_1 to return the \n", + "best score against the column of dataset_2. Here we set the scorer to \n", + "‘token_set_ratio’ with score_cutoff of (originally) 90.\n", + "\n", + "FuzzyWuzzy was written for single inputs to a web form; I, however, \n", + "am using it to compare one dataframe column to another dataframe's column,\n", + "which outside https://www.neudesic.com/blog/fuzzywuzzy-using-python/ is \n", + "poorly documented. Takes a lot to match the tokenized function output back \n", + "to the original untokenized term, which is necessary for this work.\n", + "\n", + "For more options see temp_FuzzyWuzzyHowTo.py\n", + "\n", + "Browser page looks like: \n", + "\n", + "\n", + "\n", + "\n", + "# Quick test, if you want - punctuation difference\n", + "fuzz.ratio('Testing FuzzyWuzzy', 'Testing FuzzyWuzzy!!')\n", + "\n", + "FuzzyWuzzyResults - What the results of this function mean:\n", + "('hippocratic oath', 100, 2987)\n", + "('Best match string from dataset_2' (GoldStandard), 'Score of best match', 'Index of best match string in GoldStandard')\n", + "\n", + "Re-start:\n", + "listOfUniqueUnassignedAfterUmls11 = pd.read_excel('02_Run_APIs_files/listOfUniqueUnassignedAfterUmls11.xlsx')\n", + "GoldStandard = pd.read_excel('01_Text_wrangling_files/GoldStandard_master.xlsx')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from fuzzywuzzy import fuzz, process\n", + "\n", + "# Recommendation: Test first\n", + "# fuzzySourceZ = listOfUniqueUnassignedAfterGS.iloc[0:25]\n", + "\n", + "# 2018-07-08: Created FuzzyWuzzyProcResult1, 3,000 records, in 24 minutes\n", + "# 2018-07-09: 5,000 in 39 minutes\n", + "# 2018-07-09: 4,000 in 32 minutes\n", + "fuzzySourceZ = listOfUniqueUnassignedAfterUmls2.iloc[0:4000]\n", + "\n", + "'''\n", + "fuzzySource1 = listOfUniqueUnassignedAfterGS.iloc[0:5000]\n", + "fuzzySource2 = listOfUniqueUnassignedAfterGS.iloc[5001:10678]\n", + "'''\n", + "\n", + "def fuzzy_match(x, choices, scorer, cutoff):\n", + " return process.extractOne(\n", + " x, choices=choices, scorer=scorer, score_cutoff=cutoff\n", + " )\n", + "\n", + "# Create series FuzzyWuzzyResults\n", + "FuzzyWuzzyProcResult1 = fuzzySourceZ.loc[:, 'adjustedQueryCase'].apply(\n", + " fuzzy_match,\n", + " args=( GoldStandard.loc[:, 'adjustedQueryCase'],\n", + " fuzz.token_set_ratio,\n", + " 95\n", + " )\n", + ")\n", + "\n", + "# Convert FuzzyWuzzyResults Series to df\n", + "FuzzyWuzzyProcResult2 = pd.DataFrame(FuzzyWuzzyProcResult1)\n", + "\n", + "# Move Index (IDs) into 'FuzzyIndex' col because Index values will be discarded\n", + "FuzzyWuzzyProcResult2 = FuzzyWuzzyProcResult2.reset_index()\n", + "FuzzyWuzzyProcResult2 = FuzzyWuzzyProcResult2.rename(columns={'index': 'FuzzyIndex'})\n", + "\n", + "# Remove nulls\n", + "FuzzyWuzzyProcResult2 = FuzzyWuzzyProcResult2[FuzzyWuzzyProcResult2.adjustedQueryCase.notnull() == True] # remove nulls\n", + "\n", + "# Move tuple output into 3 cols\n", + "FuzzyWuzzyProcResult2[['FuzzyToken', 'FuzzyScore', 'GoldStandardIndex']] = FuzzyWuzzyProcResult2['adjustedQueryCase'].apply(pd.Series)\n", + "FuzzyWuzzyProcResult2.drop(['adjustedQueryCase'], axis=1, inplace=True) # drop tuples\n", + "\n", + "# Merge result to the orig source list cols\n", + "FuzzyWuzzyProcResult3 = pd.merge(FuzzyWuzzyProcResult2, fuzzySourceZ, how='left', left_index=True, right_index=True)\n", + "\n", + "# Change col order for browsability if you want to analyze this by itself\n", + "FuzzyWuzzyProcResult3 = FuzzyWuzzyProcResult3[['adjustedQueryCase', 'FuzzyToken', 'FuzzyScore', 'timesSearched', 'FuzzyIndex', 'GoldStandardIndex']]\n", + "\n", + "# Merge result to GoldStandard supplemental info\n", + "# Don't have a second person altering GoldStandard during your work...\n", + "FuzzyWuzzyProcResult4 = pd.merge(FuzzyWuzzyProcResult3, GoldStandard, how='left', left_on='GoldStandardIndex', right_index=True)\n", + "\n", + "# Reduce and rename\n", + "FuzzyWuzzyProcResult4 = FuzzyWuzzyProcResult4[['adjustedQueryCase_x', 'preferredTerm', 'FuzzyToken', 'SemanticTypeName', 'SemanticGroup', 'timesSearched', 'FuzzyScore']]\n", + "FuzzyWuzzyProcResult4 = FuzzyWuzzyProcResult4.rename(columns={'adjustedQueryCase_x': 'adjustedQueryCase'})\n", + "\n", + "# Change name to be sensical inside other procedures\n", + "FuzzyWuzzyRawRecommendations = FuzzyWuzzyProcResult4\n", + "\n", + "\n", + "# Save to file so you can open in future sessions, if needed\n", + "writer = pd.ExcelWriter(localDir + 'FuzzyWuzzyRawRecommendations.xlsx')\n", + "FuzzyWuzzyRawRecommendations.to_excel(writer,'FuzzyWuzzyRawRecommendations')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "# Future: chart for precent of total that were assigned something...\n", + "\n", + "# Remove fuzzySource1, etc., FuzzyWuzzyProcResult1, FuzzyWuzzyProcResult2, etc.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Add result to MySQL, process at http://localhost:5000/fuzzy/\n", + "# ========================================================================\n", + "\n", + "# Requires manual_assignments table, see 06_Load_database.\n", + "\n", + "# Add dataframe to MySQL\n", + "\n", + "import pandas as pd\n", + "import mysql.connector\n", + "from pandas.io import sql\n", + "from sqlalchemy import create_engine\n", + "\n", + "dbconn = create_engine('mysql+mysqlconnector://wendlingd:pwd@localhost/ia')\n", + "\n", + "FuzzyWuzzyRawRecommendations.to_sql(name='manual_assignments', con=dbconn, if_exists = 'replace', index=False) # or if_exists='append'\n", + " \n", + "\n", + "'''\n", + "From MySQL command line:\n", + "LOAD DATA LOCAL INFILE '/Users/wendlingd/Downloads/FuzzyWuzzyRawRecommendations.csv' INTO TABLE manual_assignments FIELDS TERMINATED BY ',' (adjustedQueryCase, preferredTerm, FuzzyToken, SemanticTypeName, SemanticGroup, timesSearched, FuzzyScore);\n", + "\n", + "ALTER TABLE `manual_assignments` ADD `NewSemanticTypeName` VARCHAR(100) NULL AFTER `adjustedQueryCase`;\n", + "\n", + "Re-start:\n", + "FuzzyWuzzyRawRecommendations = pd.read_excel(localDir + 'FuzzyWuzzyRawRecommendations')\n", + "\n", + "\n", + "select NewSemanticTypeName, count(*) as cnt\n", + "from manual_assignments\n", + "group by NewSemanticTypeName\n", + "order by cnt DESC;\n", + "\n", + "select count(*) cnt\n", + "from manual_assignments\n", + "WHERE NewSemanticTypeName IS NOT NULL\n", + "\n", + "\n", + "\n", + "FuzzyWuzzyRawRecommendations = pd.read_excel(localDir + 'FuzzyWuzzyRawRecommendations.xlsx')\n", + "\n", + "FuzzyWuzzyRawRecommendations.to_csv(localDir + 'FuzzyWuzzyRawRecommendations.csv', index=False, header=None)\n", + "\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'''\n", + "Resolve null column issues...\n", + "- No nulls\n", + "- Look right? Consistent, etc. \n", + "\n", + "Get SemanticTypeName for terms with new preferredTerm\n", + "\n", + "SELECT preferredTerm, NewSemanticTypeName, SemanticGroup\n", + "FROM manual_assignments\n", + "WHERE NewSemanticTypeName IS NULL\n", + "ORDER BY preferredTerm\n", + "\n", + "\n", + "When NewSemanticTypeName is null\n", + "\n", + "UPDATE manual_assignments\n", + "SET NewSemanticTypeName = SemanticTypeName\n", + "WHERE NewSemanticTypeName IS NULL\n", + "\n", + "\n", + "'''" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 4. Bring data from manual_assignments back into Pandas\n", + "# ========================================================================\n", + "'''\n", + "Assign SemanticGroup from GoldStandard or other.\n", + "\n", + "'''\n", + "\n", + "\n", + "from sqlalchemy import create_engine\n", + "\n", + "dbconn = create_engine('mysql+mysqlconnector://wendlingd:DataSciPwr17@localhost/ia')\n", + "\n", + "\n", + "# Extract from MySQL to df\n", + "FuzAssigned = pd.read_sql('SELECT adjustedQueryCase, preferredTerm, NewSemanticTypeName FROM manual_assignments WHERE NewSemanticTypeName IS NOT NULL AND NewSemanticTypeName NOT LIKE \"Ignore\"', con=dbconn)\n", + "\n", + "\n", + "\n", + "# Write this to file (assuming multiple cycles)\n", + "writer = pd.ExcelWriter(localDir + 'FuzAssigned_BackFromMysql.xlsx')\n", + "FuzAssigned.to_excel(writer,'FuzAssigned')\n", + "writer.save()\n", + "\n", + "\n", + "# update SemanticGroup from GoldStandard_master\n", + "\n", + "gsUnique = GoldStandard[['preferredTerm', 'SemanticTypeName', 'SemanticGroup', 'SemanticGroupCode', 'BranchPosition', 'CustomTreeNumber']]\n", + "\n", + "gsUnique = gsUnique.drop_duplicates()\n", + "FuzAssigned2 = pd.merge(FuzAssigned, gsUnique, how='inner', on='preferredTerm')\n", + "\n", + "# Not sure why NewSemanticTypeName and SemanticTypeName are the same.\n", + "FuzAssigned2.drop(['NewSemanticTypeName'], axis=1, inplace=True)\n", + "\n", + "# Append to GoldStandard_master\n", + "GoldStandard = GoldStandard.append(FuzAssigned2, sort=True)\n", + "\n", + "# Write new GoldStandard\n", + "writer = pd.ExcelWriter('01_Text_wrangling_files/GoldStandard_master.xlsx')\n", + "GoldStandard.to_excel(writer,'GoldStandard')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Update log and GoldStandard with new matches from MySQL\n", + "# ========================================================================\n", + "'''\n", + "Move clean-up work into browser.\n", + "\n", + "delete from manual_assignments\n", + "where NewSemanticTypeName like 'Ignore'\n", + "\n", + "Re-start:\n", + "logAfterUmlsApi1 = pd.read_excel('02_Run_APIs_files/logAfterUmlsApi1.xlsx')\n", + "'''\n", + "\n", + "logAfterUmlsApi1 = pd.read_excel('02_Run_APIs_files/logAfterUmlsApi1.xlsx')\n", + "\n", + "\n", + "# Apply to log file\n", + "logAfterFuzzyMatch = pd.merge(logAfterUmlsApi1, FuzAssigned2, how='left', on='adjustedQueryCase')\n", + "\n", + "# Future: Look for a better way to do the above - MERGE WITH CONDITIONAL OVERWRITE. Temporary fix:\n", + "logAfterFuzzyMatch['preferredTerm2'] = logAfterFuzzyMatch['preferredTerm_x'].where(logAfterFuzzyMatch['preferredTerm_x'].notnull(), logAfterFuzzyMatch['preferredTerm_y'])\n", + "logAfterFuzzyMatch['SemanticTypeName2'] = logAfterFuzzyMatch['SemanticTypeName_x'].where(logAfterFuzzyMatch['SemanticTypeName_x'].notnull(), logAfterFuzzyMatch['SemanticTypeName_y'])\n", + "logAfterFuzzyMatch['SemanticGroupCode2'] = logAfterFuzzyMatch['SemanticGroupCode_x'].where(logAfterFuzzyMatch['SemanticGroupCode_x'].notnull(), logAfterFuzzyMatch['SemanticGroupCode_y'])\n", + "logAfterFuzzyMatch['SemanticGroup2'] = logAfterFuzzyMatch['SemanticGroup_x'].where(logAfterFuzzyMatch['SemanticGroup_x'].notnull(), logAfterFuzzyMatch['SemanticGroup_y'])\n", + "logAfterFuzzyMatch['BranchPosition2'] = logAfterFuzzyMatch['BranchPosition_x'].where(logAfterFuzzyMatch['BranchPosition_x'].notnull(), logAfterFuzzyMatch['BranchPosition_y'])\n", + "logAfterFuzzyMatch['CustomTreeNumber2'] = logAfterFuzzyMatch['CustomTreeNumber_x'].where(logAfterFuzzyMatch['CustomTreeNumber_x'].notnull(), logAfterFuzzyMatch['CustomTreeNumber_y'])\n", + "logAfterFuzzyMatch.drop(['preferredTerm_x', 'preferredTerm_y',\n", + " 'SemanticTypeName_x', 'SemanticTypeName_y',\n", + " 'SemanticGroup_x', 'SemanticGroup_y',\n", + " 'SemanticGroupCode_x', 'SemanticGroupCode_y',\n", + " 'BranchPosition_x', 'BranchPosition_y', \n", + " 'CustomTreeNumber_x', 'CustomTreeNumber_y'], axis=1, inplace=True)\n", + "logAfterFuzzyMatch.rename(columns={'preferredTerm2': 'preferredTerm',\n", + " 'SemanticTypeName2': 'SemanticTypeName',\n", + " 'SemanticGroup2': 'SemanticGroup',\n", + " 'SemanticGroupCode2': 'SemanticGroupCode',\n", + " 'BranchPosition2': 'BranchPosition',\n", + " 'CustomTreeNumber2': 'CustomTreeNumber'\n", + " }, inplace=True)\n", + "\n", + "\n", + "# FIXME - Why are duplicate rows introduced?\n", + "logAfterFuzzyMatch = logAfterFuzzyMatch.drop_duplicates()\n", + "\n", + "\n", + "# Save to file so you can open in future sessions, if needed\n", + "writer = pd.ExcelWriter(localDir + 'logAfterFuzzyMatch.xlsx')\n", + "logAfterFuzzyMatch.to_excel(writer,'logAfterFuzzyMatch')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "\n", + "# ---------------------------------------\n", + "# Visualize results - logAfterFuzzyMatch\n", + "# ---------------------------------------\n", + " \n", + "# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/\n", + "totCount = len(logAfterFuzzyMatch)\n", + "unassigned = logAfterFuzzyMatch['SemanticGroup'].isnull().sum()\n", + "# unassigned = logAfterFuzzyMatch.loc[logAfterFuzzyMatch['preferredTerm'].str.contains('Unparsed') == True]\n", + "assigned = totCount - unassigned\n", + "labels = ['Assigned', 'Unassigned']\n", + "sizes = [assigned, unassigned]\n", + "colors = ['lightskyblue', 'lightcoral']\n", + "explode = (0.1, 0) # explode 1st slice\n", + "plt.pie(sizes, explode=explode, labels=labels, colors=colors,\n", + " autopct='%1.f%%', shadow=True, startangle=100)\n", + "plt.axis('equal')\n", + "plt.title(\"Status after 'fuzzy match' processing\")\n", + "plt.show()\n", + "\n", + "\n", + "# Bar of SemanticGroup categories, horizontal\n", + "# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html\n", + "ax = logAfterFuzzyMatch['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6),\n", + " color=\"slateblue\", fontsize=10);\n", + "ax.set_alpha(0.8)\n", + "ax.set_title(\"Categories assigned after 'fuzzy match' processing\", fontsize=14)\n", + "ax.set_xlabel(\"Number of searches\", fontsize=9);\n", + "# set individual bar lables using above list\n", + "for i in ax.patches:\n", + " # get_width pulls left or right; get_y pushes up or down\n", + " ax.text(i.get_width()+.1, i.get_y()+.31, \\\n", + " str(round((i.get_width()), 2)), fontsize=9, color='dimgrey')\n", + "# invert for largest on top \n", + "ax.invert_yaxis()\n", + "plt.gcf().subplots_adjust(left=0.3)\n", + "\n", + "\n", + "\n", + "\n", + "'''\n", + "# Unite data, if there are multiple output files\n", + "f1 = pd.read_excel(localDir + 'FuzAssigned_Dan1.xlsx')\n", + "f2 = pd.read_excel(localDir + 'FuzAssigned_Dan2.xlsx')\n", + "f3 = pd.read_excel(localDir + 'FuzAssigned_Dan3.xlsx')\n", + "\n", + "# Concat\n", + "fmAdd1 = pd.concat([f1, f2, f3], ignore_index=True, sort=True)\n", + "\n", + "# drop SemanticTypeName (if present)\n", + "fmAdd1.drop(['SemanticTypeName'], axis=1, inplace=True)\n", + "\n", + "# Rename SemanticTypeName\n", + "fmAdd1 = fmAdd1.rename(columns={'NewSemanticTypeName': 'SemanticTypeName'})\n", + "\n", + "\n", + "# De-dupe. Future? July run, need to eyeball before deleting\n", + "# fmAdd1.drop_duplicates(subset=['A', 'C'], keep=False)\n", + "\n", + "searchLog.head(n=5)\n", + "searchLog.shape\n", + "searchLog.info()\n", + "searchLog.columns\n", + "\n", + "# Remove f1, etc.\n", + "'''\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Create new 'uniques' dataframe/file for ML\n", + "# =========================================================================\n", + "'''\n", + "Won't require Excel file if you run the df from here. File will have \n", + "updated entries from this session, by pulling back GoldStandard. Also will\n", + "make sure that previously found preferredTerm will be available as if they \n", + "were queries. To get maximum utility from the API.\n", + "\n", + "Training file: ApiAssignedSearches.xlsx (successful matches)\n", + "Unmatched terms we want to predict for: search-seed_the_ML.xlsx\n", + "\n", + "GoldStandard = pd.read_excel('01_Text_wrangling_files/GoldStandard_master.xlsx')\n", + "'''\n", + "\n", + "\n", + "# Base on UPDATED (above) GoldStandard\n", + "ApiAssignedSearches = GoldStandard\n", + "\n", + "col = ['adjustedQueryCase', 'preferredTerm', 'SemanticTypeName']\n", + "ApiAssignedSearches = ApiAssignedSearches[col]\n", + "\n", + "'''\n", + "get all preferredTerm items, dupe this into adjustedQueryCase column (so both \n", + "columns are the same, i.e, preferredTerm is also available as if it were \n", + "raw input; append to df, de-dedupe rows.\n", + "'''\n", + "\n", + "prefGrabber = ApiAssignedSearches.drop(['adjustedQueryCase'], axis=1) # drop col\n", + "prefGrabber.drop_duplicates(inplace=True) # de-dupe rows\n", + "prefGrabber['adjustedQueryCase'] = prefGrabber['preferredTerm'].str.lower() # dupe and lc\n", + "\n", + "ApiAssignedSearches = ApiAssignedSearches.append(prefGrabber, sort=True) # append to orig\n", + "ApiAssignedSearches.drop_duplicates(inplace=True) # de-dupe rows after append\n", + "\n", + "# FIXME - Some adjustedQueryCase = nan\n", + "ApiAssignedSearches.adjustedQueryCase.fillna(ApiAssignedSearches.preferredTerm, inplace=True)\n", + "ApiAssignedSearches['adjustedQueryCase'].str.lower() # str.lower the nan fixes\n", + "\n", + "# Write this to file\n", + "writer = pd.ExcelWriter(localDir + 'ApiAssignedSearches.xlsx')\n", + "ApiAssignedSearches.to_excel(writer,'ApiAssignedSearches')\n", + "writer.save()\n", + "\n", + "\n", + "# REMOVE\n", + "# Most variables but NOT ApiAssignedSearches" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 7. Next steps\n", + "# ==============\n", + "'''\n", + "Open 03_ML-classification.py, run the machine learning routines. You will use\n", + "these Excel files or dataframes\n", + "\n", + "- ApiAssignedSearches\n", + "- unassignedAfterUmls1 or unassignedAfterUmls2\n", + "\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/04_Machine_learning_classification.ipynb b/04_Machine_learning_classification.ipynb new file mode 100644 index 0000000..bf7d3c1 --- /dev/null +++ b/04_Machine_learning_classification.ipynb @@ -0,0 +1,906 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 4. Machine Learning Classification\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** scikit-learn for site-search classifications
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "\n", + "## THIS SCRIPT (A WORK IN PROGRESS)\n", + "\n", + "Some rough, not-quite in order, partly not-functioning machine learning \n", + "code that will eventually result in a dataframe of classification choices \n", + "for the ~30% of visitor search queries that the UMLS Metathesaurus is not \n", + "able to classify into broader categories (see 01_Pre-processing.py). \n", + "Some entries will be automatically assignable if misspellings can be \n", + "overcome with confidence.\n", + "\n", + "Desired end result in the future (feature name / column name is first row):\n", + "\n", + "| adjustedQueryCase | preferredTerm | SemanticType | SemanticGroup |\n", + "| gallbladder cancer | Malignant neoplasm of gallbladder | Neoplastic Process | Disorders |\n", + "\n", + "\n", + "What this script is trying to generate currently:\n", + " \n", + "| adjustedQueryCase | UmlsApproximate | pred-LinearSVC | pred-LogisticRegression | pred-NaiveBayesMultinomial |\n", + "| cbd oil | nan | Organic Chemical | Intellectual Product | Organic Chemical |\n", + "\n", + "(cbd oil is a marajuana-based product that many visitors seem to be \n", + "interested in. Variations on cbd will be avilable in the training set. \n", + "(An alternative to this work is clustering, but would it cluster correctly...)\n", + "\n", + "\n", + "Feature/column explanations:\n", + " \n", + "adjustedQueryCase - Lowercased version of query with most punctuation removed.\n", + "preferredTerm - Can be assigned later if needed, but these are the UMLS system\n", + " picks from more than 200 medical vocabularies. 01_Pre-processing.py will \n", + " assign when possible; this script does not use, as described below.\n", + " The UMLS system has 10s of thousands of these or more.\n", + "SemanticType - Select one of 130 ontology categories. Eventually I will need\n", + " to select two or more of these categories for searches that look like \n", + " \"cancer exercise.\" For now I am okay capturing one category.\n", + "SemanticGroup - One of 15 ontology super-categories.\n", + "UmlsApproximate - A new run of the UMLS API that will be set to more liberal\n", + "matching.\n", + "pred-LinearSVC, etc. - Predictions from the various models. This can be used for eyeballing\n", + "entries and adding manual assignments, to get the under-represented classes \n", + "some more content for future matching.\n", + "\n", + "\n", + "## Script contents\n", + "\n", + "1. Start-up / What to put into place, where; dataframe mods\n", + "2. Eyeball level of balance among classes under study, in training set\n", + "3. Training set: Calculate tf-idf vector for each query\n", + "4. Training set, Chi square: Find the terms most correlated with each item\n", + "5. Train, test, predict with a multi-class classifier, Naive Bayes-Multinomial\n", + "6. Model selection - Which among four models is the BEST model for this dataset?\n", + "\n", + "Asperational, not working\n", + "7. Look deeper, with a confusion matrix, into the most successful model of\n", + " our group, LinearSVC (Linear Support Vector Classification)\n", + "8. Understand the misclassifications. Should we change the model, or not?\n", + "\n", + "Somewhat working\n", + "9. Chi-square to find terms MOST CORRELATED with each category\n", + "10. TryLinearSVCdf - Unmatched terms with LinearSVC\n", + "11. TryLogisticRegressiondf - Unmatched terms with LogisticRegression\n", + "\n", + "Not working\n", + "12. Final report by category\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "\n", + "* [ ] \n", + "\n", + "- Biggest problem: I have under-represented classes such as for NLM\n", + " products, that we are building manually. See file search-seed_the_ML.xlsx - \n", + " these are not matchable to the UMLS API as currently configured (it is \n", + " configured for high-confidence matches). We're working now to improve \n", + " category prediction for things not found in the UMLS datasets, such as \n", + " misspellings, NLM products and services, partial NLM Web page titles \n", + " (I scrape the site so I have a file of these, but they are verbose), \n", + " historical names, commercial product names, etc. These will be added \n", + " to the \"GoldStandard\" file. We started with the highest-frequency \n", + " unmatched; hopefully ML can take over some or most of this. Clustering \n", + " and FuzzyWuzzy will probably help here.\n", + " \n", + "- Dan will add second and third runs for the UMLS API, as described in \n", + " 01_Pre-processing.py, to resolve non-English queries and provide a feature\n", + " (column) of UMLS API guesses, whose prediction scores were too low to \n", + " return in the \"normalized string\" procedure I am using in the single \n", + " current UMLS API run. Then an editor can perhaps choose among the UMLS, \n", + " LinearSVC, LogisticRegression, etc., predictions.\n", + " \n", + "- For the ML code below, I am trying to assign from the ~130 \n", + " *SemanticTypeName* categories (see file 01_Pre-processing_files/\n", + " SemanticNetworkReference.xlsx). I think using *SemanticTypeName* is \n", + " best for the project; we could also try to match to the 15 \n", + " super-categories and then create more routines to match to the 130 \n", + " sub-categories.\n", + " \n", + "- Add FuzzyWuzzy, perhaps fix misspellings in place (in col adjustedQueryCase).\n", + " If confidence is high that it will be the right fix...\n", + "- Add stemming, lemmatization?\n", + "- Future: Add the ability to assign one query to multiple categories.\n", + "- More FIXMEs may appear in context below.\n", + "\n", + "\n", + "## INFLUENCES\n", + "\n", + "- Susan Li, https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f\n", + " (This code came from her code; I don't know what all of it does. Still has some of her info in comments)\n", + "- Andreas Mueller, https://github.com/amueller/introduction_to_ml_with_python\n", + " (I am looking to add procedures from here that will assist in manual assign-\n", + " ments. Not sure what to add; LDA-based charts look useful.)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Start-up / What to put into place, where; dataframe mods\n", + "\n", + "Training file: ApiAssignedSearches.xlsx (successful matches)\n", + "Unmatched terms we want to predict for: search-seed_the_ML.xlsx\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import os\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/Projects/webDS/_util')\n", + "\n", + "\n", + "'''\n", + "Bring in and adjust training file, ApiAssignedSearches.xlsx - previous \n", + "successful assignments; we will need adjustedQueryCase, preferredTerm, \n", + "SemanticTypeName, SemanticGroup...\n", + "'''\n", + "\n", + "df = pd.read_excel('02_UMLS_API_files/ApiAssignedSearches.xlsx')\n", + "\n", + "\n", + "# OR...\n", + "# df = ApiAssignedSearches\n", + "\n", + "\n", + "# Don't use preferredTerm for now - will be too inaccurate\n", + "df = df.drop(['preferredTerm'], axis=1) # drop col\n", + "\n", + "# Don't try to process any non-Roman characters; eyeball and remove\n", + "# df = df[17:]\n", + "\n", + "df.info()\n", + "# df.head()\n", + "# df.columns\n", + "\n", + "# 6/23: Trouble with fit, so trying this - remove integer data type, perhaps\n", + "# Remove int values in adjustedQueryCase by removal or coerced data type change\n", + "df['adjustedQueryCase'] = df['adjustedQueryCase'].astype(str)\n", + "\n", + "\n", + "df = df.sort_values(by='adjustedQueryCase', ascending=True)\n", + "df = df.reset_index()\n", + "df.drop(['index'], axis=1, inplace=True)\n", + "\n", + "'''\n", + "df.drop(12038, inplace=True)\n", + "df.drop(10714, inplace=True)\n", + "df.drop(6822, inplace=True)\n", + "'''\n", + "df.drop(26905, inplace=True)\n", + "df.drop(26904, inplace=True)\n", + "df.drop(26903, inplace=True)\n", + "\n", + "'''\n", + "To preserve changes to training file for future sessions\n", + "\n", + "# Useful to write out the cleaned up version; if you do re-processing, you can skip a bunch of work.\n", + "writer = pd.ExcelWriter('01_Pre-processing_files/ApiAssignedSearches.xlsx')\n", + "df.to_excel(writer,'ApiAssignedSearches')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "'''\n", + "\n", + "# add a column encoding the product as an integer, because categorical \n", + "# variables are often better represented by integers than strings\n", + "df['category_id'] = df['SemanticTypeName'].factorize()[0]\n", + "\n", + "# create a couple of dictionaries for future use\n", + "category_id_df = df[['SemanticTypeName', \n", + " 'category_id']].drop_duplicates().sort_values('category_id')\n", + "category_to_id = dict(category_id_df.values)\n", + "id_to_category = dict(category_id_df[['category_id', 'SemanticTypeName']].values)\n", + "\n", + "# what the first rows look like after the mods\n", + "df.head()\n", + "\n", + "# Bring in entries to match\n", + "unassignedAfterUmls1 = pd.read_excel('02_UMLS_API_files/unassignedAfterUmls1.xlsx')\n", + "unassignedAfterUmls1 = unassignedAfterUmls1.drop(['timesSearched'], axis=1) # drop col\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "'''\n", + "*** I'M STUCK ON THIS ONE (FILE BELOW) ***\n", + "\n", + "Not sure of your definition of fuzzy matching, but I use it to describe \n", + "misspellings and also this - verbose product, service, etc. names.\n", + "\n", + "Advice on how this should be implemented would be very useful!\n", + "\n", + "People often look for words within web pages - librarians call this a \n", + "\"known item search,\" for a web page/product page/service page they are \n", + "trying to get to. Many person names are in our biography pages. Product,\n", + "service names, etc. Multiple searches for \"Kenneth Walker\" - people\n", + "are probably trying to get to https://www.nlm.nih.gov/news/nlm_mourns_ken_walker.html,\n", + "or that's where we want them to get to. \n", + "\n", + "I use SEO Spider to scrape web page content, including page titles, in this \n", + "case\n", + "they probably are trying to get to the page titled \"NLM Mourns the Loss \n", + "of H. Kenneth Walker, MD, MACP, FAAN, Former Chair of the National \n", + "Library of Medicine Board of Regents.\"\n", + "\n", + "Do I have to vectorize this whole title? How should the matches between\n", + "visitor search terms and verbose title pages, be implemented? Name \n", + "extraction first?\n", + "\n", + "I also have to do this with the list of 200+ named NLM products (mostly\n", + "databases such as pubmed.gov).\n", + "'''\n", + "\n", + "# This data needs to be used to fuzzy match against web page names.\n", + "ShouldBeFuzzyMatched = pd.read_excel('03_ML-classification_files/ShouldBeFuzzyMatched.xlsx')\n", + "\n", + "ShouldBeFuzzyMatched.head(n=10)\n", + "\n", + "'''\n", + "ShouldBeFuzzyMatched is not used in this script; please suggest methods. \n", + "For the above example I would like eventually to end up with:\n", + "\n", + "| adjustedQueryCase | preferredTerm | SemanticType | SemanticGroup |\n", + "| kenneth walker | NLM News 2018 | NLM Web Page | Intellectual Products |\n", + "\n", + "\n", + "FYI, preferredTerm is not part of this script currrently, I am dropping \n", + "that column above. But the value of preferredTerm would be \n", + "ShouldBeFuzzyMatched['ContentGroup']) when a match is made to the page title.\n", + "Regarding SemanticType, to the original ontology of ~130 types I have added \n", + "several NLM-specific type names such as \"NLM Web Page.\" Not many in the\n", + "training set yet.\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 2. Eyeball level of balance among classes under study, in training set\n", + "# Less than 1 minute\n", + "# =======================================================================\n", + "'''\n", + "Perhaps this should be used with item 12 below, Final report, after that\n", + "has been run.\n", + "'''\n", + "\n", + "fig = plt.figure(figsize=(10,20))\n", + "df.groupby('SemanticTypeName').adjustedQueryCase.count().sort_index(ascending=False).plot.barh(ylim=0, fontsize=6, color=\"slateblue\")\n", + "fig.subplots_adjust(left=0.3)\n", + "plt.title(\"Eyeball level of balance among classes\", fontsize=12)\n", + "plt.xlabel(\"Number of queries\")\n", + "plt.show()\n", + "\n", + "\n", + "'''\n", + "(Wow, lots of variation. Many under-represented classes.)\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Training set: Calculate tf-idf vector for each query\n", + "# Less than 1 minute\n", + "# ========================================================\n", + "'''\n", + "Current classifiers and learning algorithms can not directly process \n", + "text in its original form; most of them expect numerical feature vectors \n", + "with a fixed size, rather than raw text of variable length. Therefore, \n", + "in this preprocessing step, comment text will be converted to a more manageable \n", + "representation.\n", + "\n", + "One common approach for extracting features from text is to use the \n", + "\"bag of words\" model, where for each document, a complaint narrative \n", + "in our case, the presence (and often the frequency) of words is taken \n", + "into consideration, but the order in which they occur is ignored.\n", + "\n", + "Specifically, for each term in our dataset, we will calculate a measure \n", + "called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf. \n", + "\n", + "We will use sklearn.feature_extraction.text.TfidfVectorizer to calculate \n", + "a tf-idf vector for each of our consumer complaint narratives.\n", + "\n", + "Cf. http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n", + "'''\n", + "\n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "\n", + "tfidf = TfidfVectorizer(sublinear_tf=True, # True = Use a logarithmic form for frequency\n", + " min_df=5, # minimum number of documents a word must be present in to be kept\n", + " norm='l2', # to ensure all our feature vectors have a euclidian norm of 1\n", + " encoding='latin-1', \n", + " ngram_range=(1, 2), # both unigrams and bigrams\n", + " stop_words='english') # remove common \"noise\" words, limit resulting features to useful ones\n", + "\n", + "features = tfidf.fit_transform(df.adjustedQueryCase).toarray()\n", + "labels = df.category_id\n", + "features.shape\n", + "\n", + "\n", + "'''\n", + "Shape example, (29289, 75036)\n", + "\n", + "Now, each of 29289 queries is represented by 75036 features, representing \n", + "the tf-idf score for different unigrams and bigrams.\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 4. Training set, Chi square: Find the terms most correlated with each item\n", + "# Less than 1 minute\n", + "# ===========================================================================\n", + "'''\n", + "We can use sklearn.feature_selection.chi2 to find the terms that are the \n", + "most correlated with each of the products.\n", + "\n", + "Cf. http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html\n", + "'''\n", + "\n", + "from sklearn.feature_selection import chi2\n", + "\n", + "N = 2\n", + "for Product, category_id in sorted(category_to_id.items()):\n", + " features_chi2 = chi2(features, labels == category_id)\n", + " indices = np.argsort(features_chi2[0])\n", + " feature_names = np.array(tfidf.get_feature_names())[indices]\n", + " unigrams = [v for v in feature_names if len(v.split(' ')) == 1]\n", + " bigrams = [v for v in feature_names if len(v.split(' ')) == 2]\n", + " print(\"# '{}':\".format(Product))\n", + " print(\" . Most correlated unigrams:\\n . {}\".format('\\n . '.join(unigrams[-N:])))\n", + " print(\" . Most correlated bigrams:\\n . {}\".format('\\n . '.join(bigrams[-N:])))\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Train, test, predict with a multi-class classifier, Naive Bayes-Multinomial\n", + "# Less than 1 minute\n", + "# ===============================================================================\n", + "'''\n", + "To train supervised classifiers, we first transformed the “Consumer \n", + "complaint narrative” into a vector of numbers. We explored vector \n", + "representations such as TF-IDF weighted vectors.\n", + "\n", + "After having a vector representation of the text, we can train \n", + "supervised-learning classifiers to train unseen “Consumer complaint narrative” \n", + "entries and predict which of our products to assign them to.\n", + "\n", + "Here we will vectorize with CountVectorizer and transform with TfidfTransformer.\n", + "'''\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "from sklearn.feature_extraction.text import TfidfTransformer\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(df['adjustedQueryCase'], df['SemanticTypeName'], random_state = 0)\n", + "count_vect = CountVectorizer()\n", + "X_train_counts = count_vect.fit_transform(X_train)\n", + "tfidf_transformer = TfidfTransformer()\n", + "X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n", + "\n", + "\n", + "'''\n", + "Now that we have all the features and labels, we can start training the \n", + "classifier. There are a number of algorithms that might be useful for the \n", + "current dataset. Naive Bayes is a common go-to. The model most suitable \n", + "for word counts is the multinomial variant.\n", + "\n", + "Cf. http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html\n", + "\n", + "'''\n", + "\n", + "from sklearn.naive_bayes import MultinomialNB\n", + "\n", + "clf = MultinomialNB().fit(X_train_tfidf, y_train)\n", + "\n", + "\n", + "# After fitting the training set, let’s try a few predictions.\n", + "\n", + "# Tests\n", + "print(clf.predict(count_vect.transform([\"herpes i\"])))\n", + "print(clf.predict(count_vect.transform([\"bemer\"])))\n", + "print(clf.predict(count_vect.transform([\"dental journals\"])))\n", + "print(clf.predict(count_vect.transform([\"intermittent fasting\"])))\n", + "print(clf.predict(count_vect.transform([\"cardiac tamponade pericardial lymphoma\"])))\n", + "print(clf.predict(count_vect.transform([\"fisioterapia\"])))\n", + "print(clf.predict(count_vect.transform([\"diabete\"])))\n", + "print(clf.predict(count_vect.transform([\"journal of clinical and diagnostic research\"])))\n", + "print(clf.predict(count_vect.transform([\"hippocrates\"])))\n", + "print(clf.predict(count_vect.transform([\"the new england journal of medicine\"])))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 6. Model selection - Which among four models is the BEST model for this dataset?\n", + "# ~1 minute\n", + "# REQUIRES AT LEAST 5 EXAMPLES PER CLASS\n", + "# =================================================================================\n", + "'''\n", + "Let's benchmark four models used for this type of dataset, evaluate their \n", + "accuracy, and visualize their classification accuracy for our dataset.\n", + "\n", + "1. (Multinomial) Naive Bayes, http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html\n", + "2. Logistic Regression, http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", + "3. Linear Support Vector Classification, http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html\n", + "4. Random Forest, http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n", + "'''\n", + "\n", + "from sklearn.naive_bayes import MultinomialNB\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.svm import LinearSVC\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "models = [\n", + " MultinomialNB(),\n", + " LogisticRegression(random_state=0),\n", + " LinearSVC(),\n", + " RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),\n", + "]\n", + "CV = 5\n", + "cv_df = pd.DataFrame(index=range(CV * len(models)))\n", + "entries = []\n", + "for model in models:\n", + " model_name = model.__class__.__name__\n", + " accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)\n", + " for fold_idx, accuracy in enumerate(accuracies):\n", + " entries.append((model_name, fold_idx, accuracy))\n", + "cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])\n", + "\n", + "\n", + "import seaborn as sns\n", + "# Cf. https://seaborn.pydata.org/generated/seaborn.boxplot.html\n", + "\n", + "sns.boxplot(x='model_name', y='accuracy', data=cv_df).set_title(\"Classifier performance (box plot)\")\n", + "sns.stripplot(x='model_name', y='accuracy', data=cv_df, \n", + " size=8, jitter=True, edgecolor=\"gray\", linewidth=2)\n", + "plt.show()\n", + "\n", + "cv_df.groupby('model_name').accuracy.mean()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 7. Look deeper, with a confusion matrix, into the most successful model of\n", + "# our group, LinearSVC (Linear Support Vector Classification)\n", + "# Less than 1 minute\n", + "# ========================================================================\n", + "'''\n", + "Too many categories to display. Future, could use SemanticGroup (only 15 classes)\n", + "\n", + "Continuing with LinearSVC, the most-accurate model of the ones we tested, \n", + "let's create a confusion matrix to show the discrepancies between predicted \n", + "and actual labels within the categories.\n", + "\n", + "Parameters: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html\n", + "'''\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "model = LinearSVC()\n", + "\n", + "X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(\n", + " features, labels, df.index, test_size=0.33, random_state=0)\n", + "model.fit(X_train, y_train)\n", + "y_pred = model.predict(X_test)\n", + "\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "conf_mat = confusion_matrix(y_test, y_pred)\n", + "fig, ax = plt.subplots(figsize=(10,8))\n", + "sns.heatmap(conf_mat, annot=True, fmt='d',\n", + " xticklabels=category_id_df.Product.values, \n", + " yticklabels=category_id_df.Product.values)\n", + "plt.rcParams.update({'font.size': 8})\n", + "plt.ylabel('Actual')\n", + "plt.subplots_adjust(left=0.5, bottom=0.5)\n", + "plt.xlabel('Predicted')\n", + "plt.show()\n", + "\n", + "# Reduce long tag to see this as it was intended, with the Actual and \n", + "# Predicted labels, 'Credit reporting, credit repair services, or other \n", + "# personal consumer reports'.\n", + "\n", + "'''\n", + "The vast majority of the predictions end up on the diagonal (predicted \n", + "label = actual label), where we want them to be. \n", + "\n", + "However, there are a number of misclassifications, and it might be \n", + "interesting to see what those are caused by.\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 8. Understand the misclassifications. Should we change the model, or not?\n", + "# Less than 1 minute\n", + "# ==========================================================================\n", + "'''\n", + "From Susan Li's blog post, not working with this dataset.\n", + "\n", + "Uses dictionary id_to_category.\n", + "'''\n", + "\n", + "from IPython.display import display\n", + "\n", + "for predicted in category_id_df.category_id:\n", + " for actual in category_id_df.category_id:\n", + " if predicted != actual and conf_mat[actual, predicted] >= 6:\n", + " print(\"'{}' predicted as '{}' : {} examples.\".format(\n", + " id_to_category[actual], id_to_category[predicted], \n", + " conf_mat[actual, predicted]))\n", + " display(df.loc[indices_test[(y_test == actual) & \n", + " (y_pred == predicted)]]\n", + " [['SemanticTypeName', 'adjustedQueryCase']])\n", + " print('')\n", + "\n", + "'''\n", + "When things belong in multiple categories, errors will happen; not directly\n", + "fixable.\n", + "'''\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 9. Chi-square to find terms MOST CORRELATED with each category\n", + "# Less than 1 minute\n", + "# ========================================================================\n", + "'''\n", + "Again, we use the chi-squared test to find the terms that are the most \n", + "correlated with each of the categories.\n", + "\n", + "In IPython console:\n", + " Start recording print output: %logstart dan1.txt\n", + " Stop recording print output: %logstop\n", + "\n", + "Or don't specifiy file name and then look for ipython_log.py\n", + "https://ipython.org/ipython-doc/3/interactive/reference.html\n", + "'''\n", + "\n", + "model.fit(features, labels)\n", + "\n", + "from sklearn.feature_selection import chi2\n", + "\n", + "N = 3\n", + "stringCapture = \"\"\n", + "for Product, category_id in sorted(category_to_id.items()):\n", + " indices = np.argsort(model.coef_[category_id])\n", + " feature_names = np.array(tfidf.get_feature_names())[indices]\n", + " unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]\n", + " bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]\n", + " print(\"# '{}':\".format(Product))\n", + " print(\" . Top unigrams:\\n . {}\".format('\\n . '.join(unigrams)))\n", + " print(\" . Top bigrams:\\n . {}\".format('\\n . '.join(bigrams)))\n", + " stringCapture += '\\n\\n' + str(Product) + '\\n Top unigrams:\\n ' + str(unigrams) + '\\n Top bigrams:\\n ' + str(bigrams)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 10. TryLinearSVCdf - Unmatched terms with LinearSVC\n", + "# Less than 1 minute\n", + "# ========================================================================\n", + "\n", + "TryLinearSVCdf = pd.DataFrame()\n", + "TryLinearSVCdf['adjustedQueryCase'] = \"\"\n", + "TryLinearSVCdf['pred-LinearSVC'] = \"\"\n", + "\n", + "TryLinearSVC = unassignedAfterUmls1['adjustedQueryCase'].astype(str)\n", + "\n", + "text_features = tfidf.transform(TryLinearSVC)\n", + "\n", + "predictions = model.predict(text_features)\n", + "\n", + "\n", + "for queryTerm, predicted in zip(TryLinearSVC, predictions):\n", + " TryLinearSVCdf = TryLinearSVCdf.append(pd.DataFrame({'adjustedQueryCase': queryTerm, \n", + " 'pred-LinearSVC': id_to_category[predicted]}, index=[0]), ignore_index=True, sort=True)\n", + "\n", + "TryLinearSVCdf = TryLinearSVCdf[['adjustedQueryCase', 'pred-LinearSVC']]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 11. TryLogisticRegressiondf - Unmatched terms with LogisticRegression\n", + "# Less than 1 minute\n", + "# ========================================================================\n", + "'''\n", + "https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a\n", + "'''\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "logisticRegr = LogisticRegression(random_state=0)\n", + "\n", + "logisticRegr.fit(X_train, y_train)\n", + "\n", + "predictions = logisticRegr.predict(X_train)\n", + "\n", + "'''\n", + "# Use score method to get accuracy of model\n", + "score = logisticRegr.score(x_test, y_test)\n", + "print(score)\n", + "'''\n", + "\n", + "TryLogisticRegressionDf = pd.DataFrame()\n", + "TryLogisticRegressionDf['adjustedQueryCase'] = \"\"\n", + "TryLogisticRegressionDf['pred-LogisticReg'] = \"\"\n", + "\n", + "TryLogisticRegression = unassignedAfterUmls1['adjustedQueryCase'].astype(str)\n", + "\n", + "text_features = tfidf.transform(TryLogisticRegression)\n", + "\n", + "for queryTerm, predicted in zip(TryLogisticRegression, predictions):\n", + " TryLogisticRegressionDf = TryLogisticRegressionDf.append(pd.DataFrame({'adjustedQueryCase': queryTerm, \n", + " 'pred-LogisticReg': id_to_category[predicted]}, index=[0]), ignore_index=True, sort=True)\n", + "\n", + "TryLogisticRegressionDf = TryLogisticRegressionDf[['adjustedQueryCase', 'pred-LogisticReg']]\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# JOIN NEW DATAFRAMES\n", + "\n", + "twoGuesses = pd.merge(TryLinearSVCdf, TryLogisticRegressionDf)\n", + "\n", + "writer = pd.ExcelWriter('01_Pre-processing_files/twoGuesses.xlsx')\n", + "twoGuesses.to_excel(writer,'twoGuesses')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 12. Final report by category\n", + "# Less than 1 minute\n", + "# ========================================================================\n", + "'''\n", + "Classes where more training data is needed, or other changes need to be\n", + "made.\n", + "'''\n", + " \n", + "from sklearn import metrics\n", + "\n", + "print(metrics.classification_report(y_test, y_pred, \n", + " target_names=df['SemanticTypeName'].unique()))\n", + "\n", + "\n", + "'''\n", + "\n", + " precision recall f1-score support\n", + "\n", + " NLM Product or Service 0.66 0.48 0.55 299\n", + " Quantitative Concept 0.31 0.22 0.26 78\n", + "Nucleic Acid, Nucleoside, or Nucleotide 0.40 0.14 0.21 57\n", + " Therapeutic or Preventive Procedure 0.66 0.47 0.55 383\n", + " Plant 0.53 0.12 0.19 178\n", + " Organic Chemical 0.26 0.94 0.41 1212\n", + " Intellectual Product 0.50 0.29 0.36 270\n", + " Amino Acid, Peptide, or Protein 0.68 0.24 0.35 409\n", + " Cell Component 0.65 0.36 0.46 36\n", + " Pharmacologic Substance 0.29 0.15 0.20 143\n", + " Indicator, Reagent, or Diagnostic Aid 0.20 0.08 0.12 12\n", + " Temporal Concept 0.36 0.20 0.26 45\n", + " Nucleotide Sequence 0.00 0.00 0.00 3\n", + " Laboratory Procedure 0.57 0.54 0.55 98\n", + " Body Part, Organ, or Organ Component 0.55 0.39 0.45 199\n", + " Finding 0.40 0.26 0.32 334\n", + " Disease or Syndrome 0.68 0.51 0.58 1040\n", + " Spatial Concept 0.20 0.02 0.04 43\n", + " Manufactured Object 0.28 0.11 0.16 91\n", + " Cell 0.66 0.61 0.64 44\n", + " Gene or Genome 0.95 0.54 0.69 448\n", + " Vitamin 0.00 0.00 0.00 4\n", + " Immunologic Factor 0.78 0.34 0.47 116\n", + " Cell or Molecular Dysfunction 0.40 0.14 0.21 14\n", + " Diagnostic Procedure 0.62 0.41 0.50 104\n", + " Molecular Function 0.47 0.24 0.32 29\n", + " semTypeName 0.77 0.66 0.71 176\n", + " Neoplastic Process 0.00 0.00 0.00 5\n", + " Self-help or Relief Organization 0.20 0.16 0.18 19\n", + " Body Location or Region 0.58 0.37 0.45 119\n", + " Sign or Symptom 0.00 0.00 0.00 31\n", + " Acquired Abnormality 0.59 0.28 0.38 169\n", + " Medical Device 0.21 0.18 0.19 28\n", + " Anatomical Abnormality 0.74 0.67 0.70 116\n", + " Injury or Poisoning 0.41 0.33 0.37 27\n", + " Clinical Attribute 0.53 0.44 0.48 125\n", + " Mental or Behavioral Dysfunction 0.26 0.15 0.19 131\n", + " Pathologic Function 0.54 0.24 0.33 59\n", + " Population Group 1.00 0.33 0.50 9\n", + " Embryonic Structure 0.40 0.17 0.24 12\n", + " Regulation or Law 0.18 0.08 0.11 98\n", + " Qualitative Concept 0.62 0.22 0.33 94\n", + " Congenital Abnormality 0.29 0.10 0.15 72\n", + " Functional Concept 0.00 0.00 0.00 35\n", + " Occupational Activity 0.47 0.24 0.32 67\n", + " Mental Process 0.29 0.19 0.23 62\n", + " Professional or Occupational Group 0.25 0.09 0.13 11\n", + " Organization 0.50 0.29 0.37 90\n", + " Food 0.00 0.00 0.00 74\n", + " Eukaryote 0.00 0.00 0.00 20\n", + " Phenomenon or Process 0.54 0.22 0.32 58\n", + " Health Care Related Organization 0.50 0.10 0.17 49\n", + " Organism Function 0.12 0.06 0.09 31\n", + " Organ or Tissue Function 0.36 0.15 0.21 34\n", + " Social Behavior 0.31 0.08 0.13 59\n", + " Idea or Concept 0.96 0.32 0.48 78\n", + " Bacterium 0.00 0.00 0.00 2\n", + " Chemical 0.56 0.17 0.26 29\n", + " Natural Phenomenon or Process 0.31 0.17 0.22 24\n", + " Activity 0.00 0.00 0.00 2\n", + " Professional Society 0.45 0.41 0.43 123\n", + " Health Care Activity 0.33 0.06 0.10 17\n", + " Element, Ion, or Isotope 0.17 0.05 0.07 21\n", + " Physiologic Function 0.38 0.25 0.30 32\n", + " Daily or Recreational Activity 0.65 0.47 0.54 129\n", + " Biomedical Occupation or Discipline 0.00 0.00 0.00 10\n", + " Chemical Viewed Structurally 0.73 0.40 0.52 20\n", + " Receptor 0.63 0.32 0.42 38\n", + " Virus 0.75 0.17 0.27 18\n", + " Tissue 0.00 0.00 0.00 21\n", + " Organism Attribute 0.00 0.00 0.00 7\n", + " Chemical Viewed Functionally 0.50 0.31 0.38 32\n", + " Individual Behavior 0.12 0.12 0.12 8\n", + " Age Group 0.27 0.50 0.35 8\n", + " Group Attribute 0.44 0.41 0.42 17\n", + " NLM Organizational Component 0.27 0.25 0.26 12\n", + " Cell Function 0.44 0.21 0.28 39\n", + " Occupation or Discipline 0.43 0.07 0.12 90\n", + " Geographic Area 0.56 0.74 0.64 31\n", + " Clinical Drug 0.40 0.10 0.16 20\n", + " Fungus 0.45 0.49 0.47 49\n", + " Research Activity 0.14 0.07 0.10 14\n", + " Substance 0.00 0.00 0.00 2\n", + " Environmental Effect of Humans 1.00 0.29 0.44 7\n", + " Patient or Disabled Group 0.00 0.00 0.00 4\n", + " Human 0.38 0.19 0.25 16\n", + " Laboratory or Test Result 0.00 0.00 0.00 1\n", + " Reptile 0.00 0.00 0.00 2\n", + " Experimental Model of Disease 0.17 0.05 0.08 20\n", + " Biologically Active Substance 0.00 0.00 0.00 16\n", + " Conceptual Entity 0.47 0.21 0.29 39\n", + " Biomedical or Dental Material 0.69 0.19 0.30 47\n", + " Mammal 0.00 0.00 0.00 1\n", + " Amino Acid Sequence 0.11 0.11 0.11 18\n", + " Body Substance 0.00 0.00 0.00 3\n", + " Amphibian 0.00 0.00 0.00 6\n", + " Biologic Function 1.00 0.08 0.15 12\n", + " Bird 0.00 0.00 0.00 0\n", + " Anatomical Structure 0.50 0.12 0.20 8\n", + " Hormone 0.00 0.00 0.00 16\n", + " Classification 0.00 0.00 0.00 6\n", + " Fish 0.00 0.00 0.00 3\n", + " Animal 0.00 0.00 0.00 1\n", + " Event 0.31 0.31 0.31 13\n", + " Body Space or Junction 0.00 0.00 0.00 3\n", + " Antibiotic 0.00 0.00 0.00 14\n", + " Governmental or Regulatory Activity 0.77 0.38 0.51 26\n", + " Educational Activity 0.00 0.00 0.00 2\n", + " Archaeon 0.00 0.00 0.00 13\n", + " Hazardous or Poisonous Substance 0.00 0.00 0.00 5\n", + " Machine Activity 0.40 0.32 0.35 19\n", + " Genetic Function 0.00 0.00 0.00 14\n", + " Inorganic Chemical 0.00 0.00 0.00 4\n", + " Behavior 0.75 0.30 0.43 10\n", + " Human-caused Phenomenon or Process 0.00 0.00 0.00 7\n", + " Molecular Biology Research Technique 0.44 0.31 0.36 13\n", + " Body System 0.40 0.44 0.42 9\n", + " Language 0.50 0.07 0.12 15\n", + " Organism 0.00 0.00 0.00 5\n", + " Group 0.29 0.17 0.21 12\n", + " Family Group 0.00 0.00 0.00 2\n", + " Research Device 0.00 0.00 0.00 1\n", + " Physical Object 0.00 0.00 0.00 2\n", + " NLM Product 0.00 0.00 0.00 1\n", + "\n", + " avg / total 0.51 0.42 0.40 8878 \n", + "'''" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/05_Chart_the_trends.ipynb b/05_Chart_the_trends.ipynb new file mode 100644 index 0000000..dd0d3eb --- /dev/null +++ b/05_Chart_the_trends.ipynb @@ -0,0 +1,293 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 5. Chart the trends\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** Biggest Movers / Percent change charts
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "\n", + "## Script contents\n", + "\n", + "1. Start-up / What to put into place, where\n", + "2. Load and clean a subset of data\n", + "3. Put stats into form that matplotlib can consume and export data\n", + "4. Biggest movers bar chart - Percent change in search frequency\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "\n", + "* [ ] \n", + "\n", + "\n", + "## RESOURCES\n", + "\n", + "- Partly based on code from Mueller-Guido 2017, Visualize_coefficients, p 341.\n", + "- https://stackoverflow.com/questions/tagged/matplotlib\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 1. Start-up / What to put into place, where\n", + "# ============================================\n", + "\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import os\n", + "\n", + "from matplotlib.colors import ListedColormap\n", + "\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/Projects/webDS/_util')\n", + "\n", + "localDir = '05_Chart_the_trends_files/' # Different than others, see about changing\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 2. Load and clean a subset of data\n", + "# ===================================\n", + "\n", + "logAfterFuzzyMatch = pd.read_excel('03_Fuzzy_match_files/logAfterFuzzyMatch.xlsx')\n", + "\n", + "# Limit to off-LAN, NLM Home\n", + "df1 = logAfterFuzzyMatch.loc[logAfterFuzzyMatch['StaffYN'].str.contains('N') == True]\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "df1 = df1[df1.Referrer.str.contains('|'.join(searchfor))]\n", + "\n", + "'''\n", + "# If you want to remove unparsed\n", + "df1 = df1[df1.SemanticGroup.str.contains(\"Unparsed\") == False]\n", + "df1 = df1[df1.preferredTerm.str.contains(\"PubMed strategy, citation, unclear, etc.\") == False]\n", + "'''\n", + "\n", + "\n", + "# reduce cols\n", + "df2 = df1[['Timestamp', 'preferredTerm', 'SemanticTypeName', 'SemanticGroup']]\n", + "\n", + "# Get nan count, remove nan rows\n", + "Unassigned = df2['preferredTerm'].isnull().sum()\n", + "df2 = df2[~pd.isnull(df2['Timestamp'])]\n", + "df2 = df2[~pd.isnull(df2['preferredTerm'])]\n", + "df2 = df2[~pd.isnull(df2['SemanticTypeName'])]\n", + "df2 = df2[~pd.isnull(df2['SemanticGroup'])]\n", + "\n", + "# Limit to May and June and assign month name\n", + "df2.loc[(df2['Timestamp'] > '2018-05-01 00:00:00') & (df2['Timestamp'] < '2018-06-01 00:00:00'), 'Month'] = 'May'\n", + "df2.loc[(df2['Timestamp'] > '2018-06-01 00:00:00') & (df2['Timestamp'] < '2018-07-01 00:00:00'), 'Month'] = 'June'\n", + "df2 = df2.loc[(df2['Month'] != \"\")]\n", + "\n", + "\n", + "\n", + "\n", + "'''\n", + "--------------------------\n", + "IN CASE YOU COMPLETE CYCLE AND THEN SEE THAT LABELS SHOULD BE SHORTENED\n", + "\n", + "# Shorten names if needed\n", + "df2['preferredTerm'] = df2['preferredTerm'].str.replace('National Center for Biotechnology Information', 'NCBI')\n", + "df2['preferredTerm'] = df2['preferredTerm'].str.replace('Samples of Formatted Refs J Articles', 'Formatted Refs Authors J Articles')\n", + "df2['preferredTerm'] = df2['preferredTerm'].str.replace('Formatted References for Authors of Journal Articles', 'Formatted Refs J Articles')\n", + "\n", + "dobby = df2.loc[df2['preferredTerm'].str.contains('Formatted') == True]\n", + "dobby = df2.loc[df2['preferredTerm'].str.contains('Biotech') == True]\n", + "\n", + "writer = pd.ExcelWriter('03_Fuzzy_match_files/logAfterFuzzyMatch.xlsx')\n", + "df2.to_excel(writer,'logAfterFuzzyMatch')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "'''\n", + "\n", + "writer = pd.ExcelWriter('03_Fuzzy_match_files/logAfterFuzzyMatch.xlsx')\n", + "df2.to_excel(writer,'logAfterFuzzyMatch')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Count number of unique preferredTerm\n", + "\n", + "# May counts\n", + "May = df2.loc[df2['Month'].str.contains('May') == True]\n", + "MayCounts = May.groupby('preferredTerm').size()\n", + "MayCounts = pd.DataFrame({'MayCount':MayCounts})\n", + "# MayCounts = MayCounts.sort_values(by='timesSearched', ascending=False)\n", + "MayCounts = MayCounts.reset_index()\n", + "\n", + "# June counts\n", + "June = df2.loc[df2['Month'].str.contains('June') == True]\n", + "JuneCounts = June.groupby('preferredTerm').size()\n", + "JuneCounts = pd.DataFrame({'JuneCount':JuneCounts})\n", + "# JuneCounts = JuneCounts.sort_values(by='timesSearched', ascending=False)\n", + "JuneCounts = JuneCounts.reset_index()\n", + "\n", + "\n", + "# Remove rows with a count less than 10; next code would make some exponential.\n", + "MayCounts = MayCounts[MayCounts['MayCount'] >= 10]\n", + "JuneCounts = JuneCounts[JuneCounts['JuneCount'] >= 10]\n", + "\n", + "# Join, removing terms not searched in BOTH months \n", + "df3 = pd.merge(MayCounts, JuneCounts, how='inner', on='preferredTerm')\n", + "\n", + "# Assign the percentage of that month's search share\n", + "# MayPercent\n", + "df3['MayPercent'] = \"\"\n", + "MayTotal = df3.MayCount.sum()\n", + "df3['MayPercent'] = df3.MayCount / MayTotal * 100\n", + "\n", + "# JunePercent\n", + "df3['JunePercent'] = \"\"\n", + "JuneTotal = df3.JuneCount.sum()\n", + "df3['JunePercent'] = df3.JuneCount / JuneTotal * 100\n", + "\n", + "# Assign Percent Change\n", + "df3['PercentChange'] = \"\"\n", + "df3['PercentChange'] = df3.JunePercent - df3.MayPercent\n", + "\n", + "# Prep for next phase\n", + "\n", + "PercentChangeData = df3\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Put stats into form that matplotlib can consume and export data\n", + "# ===================================================================\n", + "\n", + "PercentChangeData = PercentChangeData.sort_values(by='PercentChange', ascending=True)\n", + "PercentChangeData = PercentChangeData.reset_index()\n", + "PercentChangeData.drop(['index'], axis=1, inplace=True) \n", + " \n", + "negative_values = PercentChangeData.head(20)\n", + "\n", + "positive_values = PercentChangeData.tail(20)\n", + "positive_values = positive_values.sort_values(by='PercentChange', ascending=True)\n", + "positive_values = positive_values.reset_index()\n", + "positive_values.drop(['index'], axis=1, inplace=True) \n", + "\n", + "interesting_values = negative_values.append([positive_values])\n", + "\n", + "\n", + "# Write out full file and chart file\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'PercentChangeData.xlsx')\n", + "PercentChangeData.to_excel(writer,'PercentChangeData')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "writer = pd.ExcelWriter(localDir + 'interesting_values.xlsx')\n", + "interesting_values.to_excel(writer,'interesting_values')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 4. Biggest movers bar chart - Percent change in search frequency\n", + "# =================================================================\n", + "'''\n", + "Re-start:\n", + "interesting_values = pd.read_excel(localDir + 'interesting_values.xlsx')\n", + "'''\n", + "\n", + "\n", + "# Percent change chart\n", + "cm = ListedColormap(['#0000aa', '#ff2020'])\n", + "colors = [cm(1) if c < 0 else cm(0)\n", + " for c in interesting_values.PercentChange]\n", + "ax = interesting_values.plot(x='preferredTerm', y='PercentChange',\n", + " kind='bar', \n", + " color=colors,\n", + " fontsize=10) # figsize=(30, 10), \n", + "ax.set_xlabel(\"preferredTerm\")\n", + "ax.set_ylabel(\"Percent change for June\")\n", + "ax.legend_.remove()\n", + "plt.axvline(x=19.4, linewidth=.5, color='gray')\n", + "plt.axvline(x=19.6, linewidth=.5, color='gray')\n", + "plt.subplots_adjust(bottom=0.4)\n", + "plt.ylabel(\"Percent change in search frequency\")\n", + "plt.xlabel(\"Standardized topic name from UMLS+\")\n", + "plt.xticks(rotation=60, ha=\"right\", fontsize=9)\n", + "plt.suptitle('Biggest movers - How June site searches were different from the past', fontsize=16, fontweight='bold')\n", + "plt.title('NLM Home page, classify-able search terms only. In June use of the terms on the left\\ndropped the most, and use of the terms on the right rose the most, compared to May.', fontsize=10)\n", + "plt.show()\n", + "\n", + "# How June was different than May\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Outlier check\n", + "# =================================================================\n", + "'''\n", + "Why did Bibliographic Entity increase by 4%?\n", + "'''\n", + "\n", + "huh = logAfterFuzzyMatch[logAfterFuzzyMatch.preferredTerm.str.startswith(\"Biblio\") == True] # retrieve records to eyeball\n", + "# huh = huh.groupby('preferredTerm').size()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/05b_Chart_the_trends-BiggestMovers.ipynb b/05b_Chart_the_trends-BiggestMovers.ipynb new file mode 100644 index 0000000..41dca53 --- /dev/null +++ b/05b_Chart_the_trends-BiggestMovers.ipynb @@ -0,0 +1,578 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 05b Chart the trends - \"Biggest movers\" May-June\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** May-June analysis, fuller than the 05 file. Biggest Movers / Percent change charts
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "\n", + "## Script contents\n", + "\n", + "1. Start-up / What to put into place, where\n", + "2. Unite search log data into single dataframe; globally update columns and rows\n", + "3. Separate out the queries with non-English characters\n", + "4. Run STAFF stats\n", + "5. Run PUBLIC (off-LAN) stats\n", + "6. Add result to MySQL, process at http://localhost:5000/searchsum\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "\n", + "* [ ] " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 1. Start-up / What to put into place, where\n", + "# ============================================\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import os\n", + "import string\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/Projects/webDS/_util')\n", + "\n", + "localDir = '05_Chart_the_trends_files/'\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 2. Unite search log data into single dataframe; globally update columns and rows\n", + "# =================================================================================\n", + "# What is your new log file named?\n", + "\n", + "newSearchLogFile = '00_Source_files/FY18-q3.xlsx'\n", + "\n", + "x1 = pd.read_excel(newSearchLogFile, 'Page1_1', skiprows=2)\n", + "x2 = pd.read_excel(newSearchLogFile, 'Page1_2', skiprows=2)\n", + "x3 = pd.read_excel(newSearchLogFile, 'Page1_3', skiprows=2)\n", + "x4 = pd.read_excel(newSearchLogFile, 'Page1_4', skiprows=2)\n", + "x5 = pd.read_excel(newSearchLogFile, 'Page1_5', skiprows=2)\n", + "x6 = pd.read_excel(newSearchLogFile, 'Page1_6', skiprows=2)\n", + "# x5 = pd.read_excel('00 SourceFiles/2018-06/Queries-2018-05.xlsx', 'Page1_2', skiprows=2)\n", + "\n", + "searchLog = pd.concat([x1, x2, x3, x4, x5, x6], ignore_index=True) # , x3, x4, x5, x6, x7\n", + "\n", + "searchLog.head(n=5)\n", + "searchLog.shape\n", + "searchLog.info()\n", + "searchLog.columns\n", + "\n", + "# Drop ID column, not needed\n", + "# searchLog.drop(['ID'], axis=1, inplace=True)\n", + " \n", + "# Until Cognos report is fixed, problem of blank columns, multi-word col name\n", + "# Update col name\n", + "searchLog = searchLog.rename(columns={'Search Timestamp': 'Timestamp', \n", + " 'NLM IP Y/N':'StaffYN',\n", + " 'IP':'SessionID'})\n", + "\n", + "# Remove https:// to become joinable with traffic data\n", + "searchLog['Referrer'] = searchLog['Referrer'].str.replace('https://', '')\n", + "\n", + "# Dupe off the Query column into a lower-cased 'adjustedQueryCase', which \n", + "# will be the column you match against\n", + "searchLog['adjustedQueryCase'] = searchLog['Query'].str.lower()\n", + "\n", + "# Remove incomplete rows, which can cause errors later\n", + "searchLog = searchLog[~pd.isnull(searchLog['Referrer'])]\n", + "searchLog = searchLog[~pd.isnull(searchLog['Query'])]\n", + "\n", + "# Limit to NLM Home\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "HmPgLog = searchLog[searchLog.Referrer.str.contains('|'.join(searchfor))]\n", + "\n", + "timeBoundHmPgLog = HmPgLog\n", + "\n", + "# Limit to May and June and assign month name\n", + "timeBoundHmPgLog.loc[(timeBoundHmPgLog['Timestamp'] > '2018-05-01 00:00:00') & (timeBoundHmPgLog['Timestamp'] < '2018-06-01 00:00:00'), 'Month'] = 'May'\n", + "timeBoundHmPgLog.loc[(timeBoundHmPgLog['Timestamp'] > '2018-06-01 00:00:00') & (timeBoundHmPgLog['Timestamp'] < '2018-07-01 00:00:00'), 'Month'] = 'June'\n", + "timeBoundHmPgLog = timeBoundHmPgLog.loc[(timeBoundHmPgLog['Month'] != \"\")]\n", + "# or drop nan\n", + "timeBoundHmPgLog.dropna(subset=['Month'], inplace=True) \n", + "\n", + "\n", + "# Useful to write out the cleaned up version; if you do re-processing, you can skip a bunch of work.\n", + "writer = pd.ExcelWriter(localDir + 'timeBoundHmPgLog.xlsx')\n", + "timeBoundHmPgLog.to_excel(writer,'timeBoundHmPgLog')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "# Remove x1., etc., searchLog, HmPgLog\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 3. Separate out the queries with non-English characters\n", + "# ========================================================\n", + "'''\n", + "# FIXME - STOP THIS FROM CHANGING NORMAL ROWS.\n", + "See comment in function. Trying things from:\n", + "https://stackoverflow.com/questions/36340627/removing-non-ascii-characters-and-replacing-with-spaces-from-pandas-data-frame\n", + "https://stackoverflow.com/questions/27084617/detect-strings-with-non-english-characters-in-python\n", + "https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii\n", + "https://stackoverflow.com/questions/16353729/how-do-i-use-pandas-apply-function-to-multiple-columns\n", + "And other places\n", + "\n", + "For testing\n", + "searchLogClean = pd.read_excel(localDir + 'searchLogClean.xlsx')\n", + "searchLogClean = searchLogClean.iloc[12000:13000]\n", + "searchLogClean['preferredTerm'] = searchLogClean['preferredTerm'].str.replace(None, '')\n", + "\n", + "Future: Break out languages better; assign language name, find translation API, etc.\n", + "\n", + "Re-start\n", + "MayJuneHmPg = pd.read_excel(localDir + 'searchLog-MayJune-HmPg.xlsx')\n", + "timeBoundHmPgLog = MayJuneHmPg\n", + "'''\n", + "\n", + "\n", + "# When it hangs... checkTrouble = searchLog.iloc[156422:156427]\n", + "\n", + "\n", + "timeBoundHmPgLog['preferredTerm'] = \"\"\n", + "\n", + "def foreignCharTest(row):\n", + " try: \n", + " row['Query'].encode('ascii'); \n", + " pass # Intention is, don't alter row at all; but returns None.\n", + " except UnicodeEncodeError: \n", + " return 'NON-ENGLISH CHARACTERS'\n", + "\n", + "timeBoundHmPgLog['preferredTerm'] = timeBoundHmPgLog.apply(foreignCharTest, axis=1)\n", + "\n", + "# FIXME - Find a way to restore preferredTerm\n", + "# searchLog['preferredTerm'].replace('', np.nan, inplace=True)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 4. Run STAFF stats\n", + "# ==============================\n", + "'''\n", + "On-LAN stats\n", + "FIXME - Check whether Cognos separation of Sfaff-YN can exclude reading room?\n", + "But, how many of the people in the reading room are on www.nlm.nih.gov at all?\n", + "'''\n", + "# Restrict to staff\n", + "staffStats = timeBoundHmPgLog.loc[timeBoundHmPgLog['StaffYN'].str.contains('Y') == True]\n", + "\n", + "# Staff search count\n", + "totSearchesStaff = staffStats.groupby('Month')['ID'].nunique()\n", + "print(\"\\nTotal STAFF SEARCHES in raw log file:\\n{}\".format(totSearchesStaff))\n", + "\n", + "# Staff unique queries\n", + "uniqueSearchesStaff = staffStats['Query'].nunique()\n", + "uniqueSearchesStaff\n", + "\n", + "uniqueSearchesStaffByMonth = staffStats.groupby('Month')['Query'].nunique()\n", + "uniqueSearchesStaffByMonth\n", + "\n", + "\n", + "\n", + "\n", + "# Staff session count\n", + "totSessionsStaff = staffStats.groupby('Month')['SessionID'].nunique()\n", + "print(\"\\nTotal STAFF SESSIONS in raw log file:\\n{}\".format(totSessionsStaff))\n", + "\n", + "'''\n", + "Bar chart - by number of searches per session\n", + "\n", + "Average searches per session\n", + "Median searches per session\n", + "Average searches per day (@ 22d/mo.)\n", + "Median searches per day (@ 22d/mo.)\n", + "Average sessions per day\n", + "Median sessions per day\n", + "Highest search count in one session\n", + "\n", + "'''\n", + "\n", + "\n", + "# Top 40 queries from NLM LAN, from NLM Home (not normalized)\n", + "searchLogLanYesHmPg = staffStats.loc[staffStats['StaffYN'].str.contains('Y') == True]\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "searchLogLanYesHmPg = searchLogLanYesHmPg[searchLogLanYesHmPg.Referrer.str.contains('|'.join(searchfor))]\n", + "searchLogLanYesHmPgQueryCounts = searchLogLanYesHmPg['Query'].value_counts()\n", + "searchLogLanYesHmPgQueryCounts = searchLogLanYesHmPgQueryCounts.reset_index()\n", + "searchLogLanYesHmPgQueryCounts = searchLogLanYesHmPgQueryCounts.rename(columns={'index': 'Top queries from NLM LAN, from Home, as entered', 'Query': 'Count'})\n", + "searchLogLanYesHmPgQueryCounts.head(n=25)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 5. Run PUBLIC (off-LAN) stats\n", + "# ==============================\n", + "\n", + "\n", + "visitorStats = timeBoundHmPgLog.loc[timeBoundHmPgLog['StaffYN'].str.contains('N') == True]\n", + "\n", + "# Count rows with foreign chars\n", + "foreignCount = visitorStats.loc[visitorStats['preferredTerm'].str.contains('NON-ENGLISH CHARACTERS') == True]\n", + "foreignCount.count()\n", + "\n", + "# Drop rows with foreign chars\n", + "visitorStats = visitorStats[visitorStats.preferredTerm != 'NON-ENGLISH CHARACTERS']\n", + "\n", + "# Visitor search count\n", + "totSearchesVisitors = visitorStats.groupby('Month')['ID'].nunique()\n", + "print(\"\\nTotal VISITOR SEARCHES in raw log file:\\n{}\".format(totSearches))\n", + "\n", + "# Visitor unique queries\n", + "uniqueSearchesVisitors = visitorStats['Query'].nunique()\n", + "uniqueSearchesVisitors\n", + "\n", + "uniqueSearchesVisitorsByMonth = visitorStats.groupby('Month')['Query'].nunique()\n", + "uniqueSearchesVisitorsByMonth\n", + "\n", + "\n", + "# Visitor session count\n", + "totSessionsVisitors = visitorStats.groupby('Month')['SessionID'].nunique()\n", + "print(\"\\nTotal VISITOR SESSIONS in raw log file:\\n{}\".format(totSessions))\n", + "\n", + "\n", + "\n", + "'''\n", + "Bar chart - by number of searches per session\n", + "\n", + "Average searches per session\n", + "Median searches per session\n", + "Average searches per day (@ 22d/mo.)\n", + "Median searches per day (@ 22d/mo.)\n", + "Average sessions per day\n", + "Median sessions per day\n", + "Highest search count in one session\n", + "\n", + "'''\n", + "\n", + "\n", + "# Highest session search count\n", + "SessionCounts = visitorStats['SessionID'].value_counts()\n", + "SessionCounts = pd.DataFrame({'TypeCount':SessionCounts})\n", + "SessionCounts.sort_values(\"TypeCount\", ascending=True, inplace=True)\n", + "SessionCounts = SessionCounts.reset_index()\n", + "\n", + "# test = searchLog.loc[searchLog['SessionID'].str.contains('47C9DEE89B48E22FB53E2BE2DB107763') == True]\n", + "\n", + "\n", + "# Top queries outside NLM LAN, from NLM Home (not normalized)\n", + "# May-June\n", + "df3LanNoHmPgQueryCounts = visitorStats['Query'].value_counts()\n", + "df3LanNoHmPgQueryCounts = df3LanNoHmPgQueryCounts.reset_index()\n", + "df3LanNoHmPgQueryCounts = df3LanNoHmPgQueryCounts.rename(columns={'index': 'Top queries off of LAN, from Home, as entered', 'Query': 'Count'})\n", + "df3LanNoHmPgQueryCounts.head(n=25)\n", + "\n", + "# May top 25\n", + "MayVisitorTop25 = visitorStats.loc[visitorStats['Month'].str.contains('May') == True]\n", + "MayVisitorTop25 = MayVisitorTop25['Query'].value_counts()\n", + "MayVisitorTop25 = MayVisitorTop25.reset_index()\n", + "MayVisitorTop25 = MayVisitorTop25.rename(columns={'index': 'Top VISITOR queries from NLM Home page, as entered', 'Query': 'Count'})\n", + "MayVisitorTop25.head(n=25)\n", + "\n", + "# June top 25\n", + "JuneVisitorTop25 = visitorStats.loc[visitorStats['Month'].str.contains('June') == True]\n", + "JuneVisitorTop25 = JuneVisitorTop25['Query'].value_counts()\n", + "JuneVisitorTop25 = JuneVisitorTop25.reset_index()\n", + "JuneVisitorTop25 = JuneVisitorTop25.rename(columns={'index': 'Top VISITOR queries from NLM Home page, as entered', 'Query': 'Count'})\n", + "JuneVisitorTop25.head(n=25)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# logAfterFuzzyMatch\n", + "\n", + "EffectOfLight = logAfterFuzzyMatch.loc[logAfterFuzzyMatch['Query'].str.contains('effect of light') == True]\n", + "\n", + "# Useful to write out the cleaned up version; if you do re-processing, you can skip a bunch of work.\n", + "writer = pd.ExcelWriter(localDir + 'EffectOfLight.xlsx')\n", + "EffectOfLight.to_excel(writer,'EffectOfLight')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "\n", + "dobby = logAfterFuzzyMatch.loc[logAfterFuzzyMatch['preferredTerm'].str.startswith('Samples of Formatted') == True]\n", + "\n", + "# Samples of Formatted References for Authors of Journal Articles\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 6. Add result to MySQL, process at http://localhost:5000/searchsum\n", + "# ========================================================================\n", + "'''\n", + "timeBoundHmPgLog.columns\n", + "\n", + "In phpMyAdmin:\n", + "\n", + "DROP TABLE IF EXISTS `timeboundhmpglog`;\n", + "CREATE TABLE `timeboundhmpglog` (\n", + " `Timestamp` datetime DEFAULT NULL,\n", + " `preferredTerm` text,\n", + " `SemanticTypeName` text,\n", + " `SemanticTypeCode` int(11) DEFAULT NULL,\n", + " `SemanticGroup` text,\n", + " `SemanticGroupCode` int(11) DEFAULT NULL,\n", + " `Month` text\n", + ") ENGINE=InnoDB DEFAULT CHARSET=utf8;\n", + "\n", + " \n", + " \n", + "writer = pd.ExcelWriter(localDir + 'timeBoundHmPgLog.xlsx')\n", + "timeBoundHmPgLog.to_excel(writer,'timeBoundHmPgLog')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "Re-start\n", + "MayJuneHmPg = pd.read_excel(localDir + 'searchLog-MayJune-HmPg.xlsx')\n", + "timeBoundHmPgLog = MayJuneHmPg\n", + "'''\n", + "\n", + "logAfterFuzzyMatch = pd.read_excel('03_Fuzzy_match_files/logAfterFuzzyMatch.xlsx')\n", + "\n", + "# Remove nans from Month\n", + "logAfterFuzzyMatch = logAfterFuzzyMatch.dropna(subset=['Month'])\n", + "\n", + "logAfterFuzzyMatch.columns\n", + "\n", + "# Reduce size for test\n", + "test = logAfterFuzzyMatch.iloc[0:49]\n", + "\n", + "\n", + "# Add dataframe to MySQL\n", + "\n", + "import mysql.connector\n", + "from pandas.io import sql\n", + "from sqlalchemy import create_engine\n", + "\n", + "dbconn = create_engine('mysql+mysqlconnector://wendlingd:DataSciPwr17@localhost/ia')\n", + "\n", + "logAfterFuzzyMatch.to_sql(name='timeboundhmpglog', con=dbconn, if_exists = 'replace', index=False) # or if_exists='append'\n", + " \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "'''\n", + "\n", + "test = df3\n", + "\n", + "df3.set_index('SessionID', inplace=True)\n", + "test = df3.groupby(['col2','col3'], as_index=False).count()\n", + "\n", + "\n", + "\n", + "test = df3.groupby(['Month','StaffYN'], as_index=False)['SessionID'].count()\n", + "test\n", + "\n", + "test = df3.groupby(['Month','StaffYN'], as_index=False)['SearchID'].count()\n", + "test\n", + "\n", + "\n", + "test = df3['ID'].groupby([df3['Month'], df3['StaffYN']]).size()\n", + "test\n", + "\n", + "\n", + "df3['SearchID'].count()\n", + "\n", + "test = df3.groupby(['Month', 'StaffYN'])['Referrer'].size()\n", + "test\n", + "\n", + "totSearches = df3.groupby(['Month', 'StaffYN'])['SearchID'].count()\n", + "print(\"\\nTotal SEARCHES in raw log file:\\n{}\".format(totSearches))\n", + "\n", + "totSessions = df3.groupby(['Month', 'StaffYN']).size()\n", + "print(\"\\nTotal SESSIONS in raw log file:\\n{}\".format(totSessions))\n", + "\n", + "\n", + "# pd.crosstab(df3.ID, df3.SessionID, margins=True)\n", + "\n", + "\n", + "# df3 = df3.rename(columns={'ID': 'SearchID'})\n", + "'''\n", + "\n", + "\n", + "\n", + "# Total SEARCHES in raw log file\n", + "totSearches = df3['SearchID'].groupby([df3['Month'], df3['StaffYN']]).count()\n", + "print(\"\\nTotal SEARCHES in raw log file:\\n{}\".format(totSearches))\n", + "\n", + "# Total SESSIONS in raw log file\n", + "totSessions = df3['SessionID'].groupby([df3['Month'], df3['StaffYN']]).count()\n", + "print(\"\\nTotal SESSIONS in raw log file:\\n{}\".format(totSessions))\n", + "\n", + "\n", + "\n", + "print(\"Total searches in raw log file: {}\".format(len(df3)))\n", + "\n", + "# totals\n", + "print(\"\\nTotal SEARCH QUERIES, on NLM LAN or not\\n{}\".format(df3['StaffYN'].value_counts()))\n", + "\n", + "print(\"\\nTotal SESSIONS, on NLM LAN or not\\n{}\".format(df3.groupby('StaffYN')['SessionID'].nunique()))\n", + "\n", + "\n", + "\n", + "\n", + "test = df3['SearchID'].groupby(df3['Month'])\n", + "test.count()\n", + "\n", + "# If you see digits in text col, perhaps these are partial log entries - eyeball for removal\n", + "# df3.drop(76080, inplace=True)\n", + "\n", + "\n", + "test = df3['StaffYN'].groupby(df3['Month'])\n", + "test.count()\n", + "\n", + "\n", + "\n", + "# Total SEARCHES containing 'Non-English characters'\n", + "print(\"Total SEARCHES with non-English characters\\n{}\".format(df3['preferredTerm'].value_counts()))\n", + "\n", + "# Total SESSIONS containing 'Non-English characters'\n", + "# Future\n", + "\n", + "\n", + "\n", + "\n", + "# How to set a date range\n", + "AprMay = logAfterUmlsApi1[(logAfterUmlsApi1['Timestamp'] > '2018-04-01 01:00:00') & (logAfterUmlsApi1['Timestamp'] < '2018-06-01 00:00:00')]\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "# Top queries from LAN (not normalized)\n", + "df3LanYes = df3.loc[df3['StaffYN'].str.contains('Y') == True]\n", + "df3LanYesQueryCounts = df3LanYes['Query'].value_counts()\n", + "df3LanYesQueryCounts = df3LanYesQueryCounts.reset_index()\n", + "df3LanYesQueryCounts = df3LanYesQueryCounts.rename(columns={'index': 'Top staff queries as entered', 'Query': 'Count'})\n", + "df3LanYesQueryCounts.head(n=30)\n", + "\n", + "# Top queries from NLM LAN, from NLM Home (not normalized)\n", + "df3LanYesHmPg = df3.loc[df3['StaffYN'].str.contains('Y') == True]\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "df3LanYesHmPg = df3LanYesHmPg[df3LanYesHmPg.Referrer.str.contains('|'.join(searchfor))]\n", + "df3LanYesHmPgQueryCounts = df3LanYesHmPg['Query'].value_counts()\n", + "df3LanYesHmPgQueryCounts = df3LanYesHmPgQueryCounts.reset_index()\n", + "df3LanYesHmPgQueryCounts = df3LanYesHmPgQueryCounts.rename(columns={'index': 'Top queries from NLM LAN, from Home, as entered', 'Query': 'Count'})\n", + "df3LanYesHmPgQueryCounts.head(n=25)\n", + "\n", + "\n", + "# Top queries outside NLM LAN (not normalized)\n", + "df3LanNo = df3.loc[df3['StaffYN'].str.contains('N') == True]\n", + "df3LanNoQueryCounts = df3LanNo['Query'].value_counts()\n", + "df3LanNoQueryCounts = df3LanNoQueryCounts.reset_index()\n", + "df3LanNoQueryCounts = df3LanNoQueryCounts.rename(columns={'index': 'Top queries off of LAN, as entered', 'Query': 'Count'})\n", + "df3LanNoQueryCounts.head(n=25)\n", + "\n", + "\n", + "\n", + "# Top home page queries, staff or public\n", + "searchfor = ['www.nlm.nih.gov$', 'www.nlm.nih.gov/$']\n", + "df3AllHmPgQueryCounts = df3[df3.Referrer.str.contains('|'.join(searchfor))]\n", + "df3AllHmPgQueryCounts = df3AllHmPgQueryCounts['Query'].value_counts()\n", + "df3AllHmPgQueryCounts = df3AllHmPgQueryCounts.reset_index()\n", + "df3AllHmPgQueryCounts = df3AllHmPgQueryCounts.rename(columns={'index': 'Top home page queries, staff or public, as entered', 'Query': 'Count'})\n", + "df3AllHmPgQueryCounts.head(n=25)\n", + "\n", + "\n", + "# FIXME - Add table, Percentage of staff, public searches done within pages, within search results\n", + "\n", + "\n", + "# FIXME - Add table for Top queries with columns/counts On LAN, Off LAN, Total\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "# Remove the searches run from within search results screens, vsearch.nlm.nih.gov/vivisimo/\n", + "# I'm not looking at these now; you might be.\n", + "df3 = df3[df3.Referrer.str.startswith(\"www.nlm.nih.gov\") == True]\n", + "\n", + "# Not sure what these are, www.nlm.nih.gov/?_ga=2.95055260.1623044406.1513044719-1901803437.1513044719\n", + "df3 = df3[df3.Referrer.str.startswith(\"www.nlm.nih.gov/?_ga=\") == False]\n", + "\n", + "\n", + "# FIXME - VARIABLE EXPLORER: After saving the stats, remove unneeded 'Type=DataFrame' items\n", + "'''\n", + "Remove manually for now.\n", + "Not finding an equiv to R's rm; cf https://stackoverflow.com/questions/32247643/how-to-delete-multiple-pandas-python-dataframes-from-memory-to-save-ram?rq=1\n", + "pd.x1(), pd.x2(), # pd.x3(), pd.x4(), pd.x5(), pd.x6(), pd.x7(), \n", + " pd.searchLogLanYes(), pd.searchLogLanYesHmPg(), \n", + " pd.searchLogLanNo(), pd.searchLogLanNoHmPg(),\n", + " pd.searchLogAllHmPg()\n", + "'''\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/06_Load_database.ipynb b/06_Load_database.ipynb new file mode 100644 index 0000000..9e7506c --- /dev/null +++ b/06_Load_database.ipynb @@ -0,0 +1,316 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 6. Load database\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** The log table, in the database
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "For now let's load search_log and semantic_network. I decided not to load other tables for now; let's see how the work goes. The next candidate would be 01_Text_wrangling_files/GoldStandard_master.xlsx\n", + "\n", + "Preference: Postgres. If MySQL, the 03_Fuzzy_match file has code for SQLAlchemy, MySQLConnector, etc.\n", + "\n", + "\n", + "# Contents\n", + "1. search_log table\n", + "2. manual_assignments table\n", + "3. semantic_network table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 1. search_log table" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nsql code; this has worked with past code.\\n\\nDROP TABLE IF EXISTS `search_log`;\\nCREATE TABLE `search_log` (\\n `search_log_id` INT PRIMARY KEY NOT NULL AUTO_INCREMENT,\\n `Timestamp` datetime DEFAULT NULL,\\n `Query` varchar(800) DEFAULT NULL,\\n `Address` varchar(900) DEFAULT NULL,\\n `SessionID` varchar(15) NOT NULL,\\n `preferredTerm` text,\\n `SemanticTypeName` text,\\n `SemanticTypeCode` int(11) DEFAULT NULL,\\n `SemanticGroup` text,\\n `SemanticGroupCode` int(11) DEFAULT NULL,\\n `Month` text\\n) ENGINE=InnoDB DEFAULT CHARSET=utf8;\\n\\n# For a quick start with data, please try 06_Load_database/logAfterGoldStandard.xlsx, \\nwhich was copied over from 01_Text_wrangling_files.\\n'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'''\n", + "sql code; this has worked with past code.\n", + "\n", + "DROP TABLE IF EXISTS `search_log`;\n", + "CREATE TABLE `search_log` (\n", + " `search_log_id` INT PRIMARY KEY NOT NULL AUTO_INCREMENT,\n", + " `Timestamp` datetime DEFAULT NULL,\n", + " `Query` varchar(800) DEFAULT NULL,\n", + " `Address` varchar(900) DEFAULT NULL,\n", + " `SessionID` varchar(15) NOT NULL,\n", + " `preferredTerm` text,\n", + " `SemanticTypeName` text,\n", + " `SemanticTypeCode` int(11) DEFAULT NULL,\n", + " `SemanticGroup` text,\n", + " `SemanticGroupCode` int(11) DEFAULT NULL,\n", + " `Month` text\n", + ") ENGINE=InnoDB DEFAULT CHARSET=utf8;\n", + "\n", + "# For a quick start with data, please try 06_Load_database/logAfterGoldStandard.xlsx, \n", + "which was copied over from 01_Text_wrangling_files.\n", + "'''" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 2. manual_assignments table\n", + "\n", + "See 03_Fuzzy_match for how this is constructed/used.\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "''' \n", + "DROP TABLE IF EXISTS manual_assignments;\n", + "CREATE TABLE `manual_assignments` (\n", + " `assignment_id` INT PRIMARY KEY NOT NULL AUTO_INCREMENT,\n", + " `adjustedQueryCase` varchar(200) NULL,\n", + " `NewSemanticTypeName` varchar(100) NULL,\n", + " `preferredTerm` varchar(200) NULL,\n", + " `FuzzyToken` varchar(50) NULL,\n", + " `SemanticTypeName` varchar(100) NULL,\n", + " `SemanticGroup` varchar(50) NULL,\n", + " `timesSearched` int(11) NULL,\n", + " `FuzzyScore` int(11) NULL\n", + ") ENGINE=InnoDB DEFAULT CHARSET=utf8;\n", + "'''" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 3. semantic_network table\n", + "I didn't see what I needed online, so created it here. Should be sufficient for joining with the processed logs for reporting.\n", + "\n", + "Table has one UMLS Semantic Type per row.\n", + "\n", + "* SemanticGroupCode: With SemanticGroup, SemanticGroupAbr, identifies ~15 supergroups, see McCray AT, Burgun A, Bodenreider O. (2001). Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Inform. 84(Pt 1):216-20. PMID: 11604736. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4300099/ and https://semanticnetwork.nlm.nih.gov/.\n", + "* SemanticGroup: See SemanticGroupCode.\n", + "* SemanticGroupAbr: See SemanticGroupCode.\n", + "* CustomTreeNumber: An attempt to get queries to dump in the correct order so counts could be attached to each semantic type, with proper indentation.\n", + "* SemanticTypeName: See https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html\n", + "* BranchPosition: Use to create indents in browser-based reporting to make it look like https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html (but one column).\n", + "* Definition: Sem type Definition from UMLS documentation.\n", + "* Examples: Sem type examples from UMLS doc, with a few added.\n", + "* RelationName: Semantic \"triples\" from UMLS doc; not currently used.\n", + "* SemTypeTreeNo: Sem type tree number from UMLS doc; not currently used.\n", + "* UsageNote: From UMLS doc.\n", + "* Abbreviation: Sem type abbrev from UMLS doc; not currently used.\n", + "* UniqueID: Another attempt to sort the table in hierarchical order.\n", + "* NonHumanFlag: From UMLS doc; not currently used.\n", + "* RecordType: From UMLS doc; not currently used. The UMLS Semantic Network includes other content that could be added such as if we wanted to do {item} {howrelated} {item}. \n", + "\n", + "This information can go out of date over time. It is current as of Summer 2018." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'''\n", + "# sql code\n", + "CREATE TABLE `semantic_network` (\n", + " `SemanticGroupCode` bigint(20) DEFAULT NULL,\n", + " `SemanticGroup` text,\n", + " `SemanticGroupAbr` text,\n", + " `CustomTreeNumber` bigint(20) DEFAULT NULL,\n", + " `SemanticTypeName` text,\n", + " `BranchPosition` bigint(20) DEFAULT NULL,\n", + " `Definition` text,\n", + " `Examples` text,\n", + " `RelationName` text,\n", + " `SemTypeTreeNo` text,\n", + " `UsageNote` text,\n", + " `Abbreviation` text,\n", + " `UniqueID` bigint(20) DEFAULT NULL,\n", + " `NonHumanFlag` text,\n", + " `RecordType` text\n", + ") ENGINE=InnoDB DEFAULT CHARSET=utf8;\n", + "\n", + "--\n", + "-- Dumping data for table `semantic_network`\n", + "--\n", + "\n", + "INSERT INTO `semantic_network` (`SemanticGroupCode`, `SemanticGroup`, `SemanticGroupAbr`, `CustomTreeNumber`, `SemanticTypeName`, `BranchPosition`, `Definition`, `Examples`, `RelationName`, `SemTypeTreeNo`, `UsageNote`, `Abbreviation`, `UniqueID`, `NonHumanFlag`, `RecordType`) VALUES\n", + "(1, 'Activities and Behaviors', 'ACTI', 2, 'Event', 1, 'A broad type for grouping activities, processes and states.', 'Anniversaries; Exposure to Mumps virus (event); Device Unattended', '{inverse_isa} Activity; {inverse_isa} Phenomenon or Process', 'B', 'Few concepts will be assigned to this broad type.', 'evnt', 1051, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 21, 'Activity', 2, 'An operation or series of operations that an organism or machine carries out or participates in.', 'Expeditions; Information Distribution; Social Planning', '{isa} Event; {inverse_isa} Behavior; {inverse_isa} Daily or Recreational Activity; {inverse_isa} Occupational Activity; {inverse_isa} Machine Activity', 'B1', 'Few concepts will be assigned to this broad type. Wherever possible, one of the more specific types from this hierarchy will be chosen. For concepts assigned to this type, the focus of interest is on the activity. When the focus of interest is the individual or group that is carrying out the activity, then a type from the \\'Behavior\\' hierarchy will be chosen. In general, concepts will not receive a type from both the \\'Activity\\' and the \\'Behavior\\' hierarchies.', 'acty', 1052, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 211, 'Behavior', 3, 'Any of the psycho-social activities of humans or animals that can be observed directly by others or can be made systematically observable by the use of special strategies.', 'Homing Behavior; Sexuality; Habitat Selection', '{isa} Activity; {inverse_isa} Social Behavior; {inverse_isa} Individual Behavior', 'B1.1', 'Few concepts will be assigned to this broad type. For concepts assigned to the \\'Behavior\\' hierarchy, the focus of interest is on the individual or group that is carrying out the activity. When the activity is of paramount interest, then a type from the \\'Activity\\' hierarchy will be chosen. In general, concepts will not receive a type from both the \\'Behavior\\' and the \\'Activity\\' hierarchies.', 'bhvr', 1053, 'Y', 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 212, 'Daily or Recreational Activity', 3, 'An activity carried out for recreation or exercise, or as part of daily life.', 'Badminton; Dancing; Swimming', '{isa} Activity', 'B1.2', NULL, 'dora', 1056, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 213, 'Occupational Activity', 3, 'An activity carried out as part of an occupation or job.', 'Collective Bargaining; Commerce; Containment of Biohazards', '{isa} Activity; {inverse_isa} Health Care Activity; {inverse_isa} Research Activity; {inverse_isa} Governmental or Regulatory Activity; {inverse_isa} Educational Activity', 'B1.3', NULL, 'ocac', 1057, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 214, 'Machine Activity', 3, 'An activity carried out primarily or exclusively by machines.', 'Computer Simulation; Equipment Failure; Natural Language Processing', '{isa} Activity', 'B1.4', NULL, 'mcha', 1066, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 2111, 'Social Behavior', 4, 'Behavior that is a direct result or function of the interaction of humans or animals with their fellows. This includes behavior that may be considered anti-social.', 'Acculturation; Communication; Interpersonal Relations', '{isa} Behavior', 'B1.1.1', '\\'Social Behavior\\' requires the direct participation of others and is, thus, distinguished from \\'Individual Behavior\\' which is carried out by an individual, though others may be present.', 'socb', 1054, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 2112, 'Individual Behavior', 4, 'Behavior exhibited by a human or an animal that is not a direct result of interaction with other members of the species, but which may have an effect on others.', 'Assertiveness; Grooming; Risk-Taking', '{isa} Behavior', 'B1.1.2', '\\'Individual Behavior\\' is carried out by an individual, though others may be present, and is, thus, distinguished from \\'Social Behavior\\' which requires the direct participation of others.', 'inbe', 1055, NULL, 'STY'),\n", + "(1, 'Activities and Behaviors', 'ACTI', 2133, 'Governmental or Regulatory Activity', 4, 'An activity carried out by officially constituted governments, or an activity related to the creation or enforcement of the rules or regulations governing some field of endeavor.', 'Certification; Credentialing; Public Policy', '{isa} Occupational Activity', 'B1.3.3', NULL, 'gora', 1064, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 112, 'Anatomical Structure', 3, 'A normal or pathological part of the anatomy or structural organization of an organism.', 'Cadaver; Pharyngostome; Anatomic structures', '{isa} Physical Object; {inverse_isa} Embryonic Structure; {inverse_isa} Fully Formed Anatomical Structure; {inverse_isa} Anatomical Abnormality', 'A1.2', 'Few concepts will be assigned to this broad type.', 'anst', 1017, 'Y', 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 1121, 'Embryonic Structure', 4, 'An anatomical structure that exists only before the organism is fully formed; in mammals, for example, a structure that exists only prior to the birth of the organism. This structure may be normal or abnormal.', 'Blastoderm; Fetus; Neural Crest', '{isa} Anatomical Structure', 'A1.2.1', NULL, 'emst', 1018, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 1123, 'Fully Formed Anatomical Structure', 4, 'An anatomical structure in a fully formed organism; in mammals, for example, a structure in the body after the birth of the organism.', 'Entire body as a whole; Female human body; Set of parts of human body', '{isa} Anatomical Structure; {inverse_isa} Body Part, Organ, or Organ Component; {inverse_isa} Tissue; {inverse_isa} Cell; {inverse_isa} Cell Component; {inverse_isa} Gene or Genome', 'A1.2.3', 'Few concepts will be assigned to this broad type.', 'ffas', 1021, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 1142, 'Body Substance', 4, 'Extracellular material, or mixtures of cells and extracellular material, produced, excreted, or accreted by the body. Included here are substances such as saliva, dental enamel, sweat, and gastric acid.', 'Amniotic Fluid; saliva; Smegma', '{isa} Substance', 'A1.4.2', NULL, 'bdsu', 1031, 'Y', 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 11231, 'Body Part, Organ, or Organ Component', 5, 'A collection of cells and tissues which are localized to a specific area or combine and carry out one or more specialized functions of an organism. This ranges from gross structures to small components of complex organs. These structures are relatively localized in comparison to tissues.', 'Aorta; Brain Stem; Structure of neck of femur', '{isa} Fully Formed Anatomical Structure', 'A1.2.3.1', 'When assigning this type, consider whether \\'Body Location or Region\\' might be the correct choice.', 'bpoc', 1023, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 11232, 'Tissue', 5, 'An aggregation of similarly specialized cells and the associated intercellular substance. Tissues are relatively non-localized in comparison to body parts, organs or organ components.', 'Cartilage; Endothelium; Epidermis', '{isa} Fully Formed Anatomical Structure', 'A1.2.3.2', NULL, 'tisu', 1024, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 11233, 'Cell', 5, 'The fundamental structural and functional unit of living organisms.', 'B-Lymphocytes; Dendritic Cells; Fibroblasts', '{isa} Fully Formed Anatomical Structure', 'A1.2.3.3', NULL, 'cell', 1025, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 11234, 'Cell Component', 5, 'A part of a cell or the intercellular matrix, generally visible by light microscopy.', 'Axon; Golgi Apparatus; Organelles', '{isa} Fully Formed Anatomical Structure', 'A1.2.3.4', NULL, 'celc', 1026, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 12141, 'Body System', 5, 'A complex of anatomical structures that performs a common function.', 'Endocrine system; Renin-angiotensin system; Reticuloendothelial System', '{isa} Functional Concept', 'A2.1.4.1', NULL, 'bdsy', 1022, NULL, 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 12151, 'Body Space or Junction', 5, 'An area enclosed or surrounded by body parts or organs or the place where two anatomical structures meet or connect.', 'Knee joint; Greater sac of peritoneum; Synapses', '{isa} Spatial Concept', 'A2.1.5.1', NULL, 'bsoj', 1030, 'Y', 'STY'),\n", + "(2, 'Anatomy ', 'ANAT', 12152, 'Body Location or Region', 5, 'An area, subdivision, or region of the body demarcated for the purpose of topographical description.', 'Forehead; Sublingual Region; Base of skull structure', '{isa} Spatial Concept', 'A2.1.5.2', 'When assigning this type, consider whether \\'Body Part, Organ, or Organ Component\\' might be the correct choice.', 'blor', 1029, 'Y', 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1133, 'Clinical Drug', 4, 'A pharmaceutical preparation as produced by the manufacturer. The name usually includes the substance, its strength, and the form, but may include the substance and only one of the other two items.', 'Ranitidine 300 MG Oral Tablet [Zantac]; Aspirin 300 MG Delayed Release Oral Tablet; sleeping pill', '{isa} Manufactured Object', 'A1.3.3', 'Do not double type with Pharmacologic Substance, Antibiotic, or other chemical semantic types.', 'clnd', 1200, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141, 'Chemical', 4, 'Compounds or substances of definite molecular composition. Chemicals are viewed from two distinct perspectives in the network, functionally and structurally. Almost every chemical concept is assigned at least two types, generally one from the structure hierarchy and at least one from the function hierarchy.', 'Acids; Chemicals; Ionic Liquids', '{isa} Substance; {inverse_isa} Chemical Viewed Structurally; {inverse_isa} Chemical Viewed Functionally', 'A1.4.1', 'Few concepts will be assigned to this broad type.', 'chem', 1103, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 11411, 'Chemical Viewed Functionally', 5, 'A chemical viewed from the perspective of its functional characteristics or pharmacological activities.', 'Aerosol Propellants; Detergents; Stabilizing Agents', '{isa} Chemical; {inverse_isa} Pharmacologic Substance; {inverse_isa} Biomedical or Dental Material; {inverse_isa} Biologically Active Substance; {inverse_isa} Indicator, Reagent, or Diagnostic Aid; {inverse_isa} Hazardous or Poisonous Substance', 'A1.4.1.1', 'A specific chemical will not be assigned here. Groupings of chemicals viewed functionally, such as \\\"Aerosol Propellants\\\" may appropriately be assigned here. A name that is inherently functional, such as \\\"Food Additives\\\", will not also be assigned a type from the \\'Chemical Viewed Structurally\\' hierarchy.', 'chvf', 1120, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 11412, 'Chemical Viewed Structurally', 5, 'A chemical or chemicals viewed from the perspective of their structural characteristics. Included here are concepts which can mean either a salt, an ion, or a compound (e.g., \\\"Bromates\\\" and \\\"Bromides\\\").', 'Ammonium Compounds; Cations; Sulfur Compounds', '{isa} Chemical; {inverse_isa} Organic Chemical; {inverse_isa} Element, Ion, or Isotope; {inverse_isa} Inorganic Chemical', 'A1.4.1.2', 'Concepts are assigned to this type if they can be both organic and inorganic, e.g. sulfur compounds. Do not use this type if the concept has an important functional aspect, e.g., \\\"Mylanta Double Strength Liquid\\\" contains Al(OH)3, Mg(OH)2, and simethicone, but would be assigned only to \\'Pharmacologic Substance\\'.', 'chvs', 1104, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114111, 'Pharmacologic Substance', 6, 'A substance used in the treatment or prevention of pathologic disorders. This includes substances that occur naturally in the body and are administered therapeutically.', 'Antiemetics; Cardiovascular Agents; Alka-Seltzer', '{isa} Chemical Viewed Functionally; {inverse_isa} Antibiotic', 'A1.4.1.1.1', 'If a substance is both endogenous and typically used as a drug, then this type and the type \\'Biologically Active Substance\\' or one of its children are assigned. Body substances that are used therapeutically such as whole blood preparation, NOS would only receive the type \\'Body Substance\\'. Substances used in the diagnosis or analysis of normal and abnormal body functions should be given the type \\'Indicator, Reagent, or Diagnostic Aid\\'.', 'phsu', 1121, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114112, 'Biomedical or Dental Material', 6, 'A substance used in biomedicine or dentistry predominantly for its physical, as opposed to chemical, properties. Included here are biocompatible materials, tissue adhesives, bone cements, resins, toothpastes, etc.', 'Acrylic Resins; Bone Cements; Dentifrices', '{isa} Chemical Viewed Functionally', 'A1.4.1.1.2', NULL, 'bodm', 1122, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114113, 'Biologically Active Substance', 6, 'A generally endogenous substance produced or required by an organism, of primary interest because of its role in the biologic functioning of the organism that produces it.', 'Cytokinins; Pheromone', '{isa} Chemical Viewed Functionally; {inverse_isa} Hormone; {inverse_isa} Enzyme; {inverse_isa} Vitamin; {inverse_isa} Immunologic Factor; {inverse_isa} Receptor', 'A1.4.1.1.3', 'If a substance is both endogenous and typically used as a drug, then this type and the type \\'Pharmacologic Substance\\' are assigned.', 'bacs', 1123, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114114, 'Indicator, Reagent, or Diagnostic Aid', 6, 'A substance primarily of interest for its use in laboratory or diagnostic tests and procedures to detect, measure, examine, or analyze other chemicals, processes, or conditions.', 'Fluorescent Dyes; Indicators and Reagents; India ink stain', '{isa} Chemical Viewed Functionally', 'A1.4.1.1.4', 'Radioactive imaging agents should be assigned to this type and not to the type \\'Pharmacologic Substance\\' unless they are also being used therapeutically.', 'irda', 1130, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114115, 'Hazardous or Poisonous Substance', 6, 'A substance of concern because of its potentially hazardous or toxic effects. This would include most drugs of abuse, as well as agents that require special handling because of their toxicity.', 'Carcinogens; Fumigant; Mutagens', '{isa} Chemical Viewed Functionally', 'A1.4.1.1.5', 'Most pharmaceutical agents, although potentially harmful, are excluded here and are assigned to the type \\'Pharmacologic Substance\\'. All pesticides are assigned to this type.', 'hops', 1131, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114121, 'Organic Chemical', 6, 'The general class of carbon-containing compounds, usually based on carbon chains or rings, and also containing hydrogen (hydrocarbons), with or without nitrogen, oxygen, or other elements in which the bonding between elements is generally covalent.', 'Benzene Derivatives', '{isa} Chemical Viewed Structurally; {inverse_isa} Nucleic Acid, Nucleoside, or Nucleotide; {inverse_isa} Amino Acid, Peptide, or Protein', 'A1.4.1.2.1', 'Salts of organic chemicals (such as Calcium Acetate) would be considered organic chemicals and should not also receive the type \\'Inorganic Chemical\\'.', 'orch', 1109, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114122, 'Inorganic Chemical', 6, 'Chemical elements and their compounds, excluding the hydrocarbons and their derivatives (except carbides, carbonates, cyanides, cyanates and carbon disulfide). Generally inorganic compounds contain ionic bonds. Included here are inorganic acids and salts, alloys, alkalies, and minerals.', 'Carbonic Acid; aluminum nitride; ferric citrate', '{isa} Chemical Viewed Structurally', 'A1.4.1.2.2', NULL, 'inch', 1197, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 114123, 'Element, Ion, or Isotope', 6, 'One of the 109 presently known fundamental substances that comprise all matter at and above the atomic level. This includes elemental metals, rare gases, and most abundant naturally occurring radioactive elements, as well as the ionic counterparts of elements (NA+, Cl-), and the less abundant isotopic forms. This does not include organic ions such as iodoacetate to which the type \\'Organic Chemical\\' is assigned.', 'Carbon; Chromium Isotopes; Radioisotopes', '{isa} Chemical Viewed Structurally', 'A1.4.1.2.3', 'Group terms such as sulfates would be assigned to the type \\'Chemical Viewed Structurally\\'. Substances such as aluminum chloride would be assigned the type \\'Inorganic Chemical\\'. Technetium Tc 99m Aggregated Albumin would not receive this type.', 'elii', 1196, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141111, 'Antibiotic', 7, 'A pharmacologically active compound produced by growing microorganisms which kill or inhibit growth of other microorganisms.', 'Antibiotics; bactericide; Thienamycins', '{isa} Pharmacologic Substance', 'A1.4.1.1.1.1', NULL, 'antb', 1195, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141132, 'Hormone', 7, 'In animals, a chemical usually secreted by an endocrine gland whose products are released into the circulating fluid. Hormones act as chemical messengers and regulate various physiologic processes such as growth, reproduction, metabolism, etc. They usually fall into two broad classes, steroid hormones and peptide hormones.', 'Enteric Hormones; thymic humoral factor; Prohormone', '{isa} Biologically Active Substance', 'A1.4.1.1.3.2', 'Synthetic hormones that are used as drugs should receive this type and \\'Pharmacologic Substance\\'. Plant hormones are assigned only to the type \\'Pharmacologic Substance\\'.', 'horm', 1125, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141133, 'Enzyme', 7, 'A complex chemical, usually a protein, that is produced by living cells and which catalyzes specific biochemical reactions. There are six main types of enzymes: oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases.', 'GTP Cyclohydrolase II; enzyme substrate complex; arginine amidase', '{isa} Biologically Active Substance', 'A1.4.1.1.3.3', 'Generally when a concept is assigned to this type, it will also be assigned to the type \\'Amino Acid, Peptide, or Protein\\'.', 'enzy', 1126, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141134, 'Vitamin', 7, 'A substance, usually an organic chemical complex, present in natural products or made synthetically, which is essential in the diet of man or other higher animals. Included here are vitamin precursors, provitamins, and vitamin supplements.', '5,25-Dihydroxy cholecalciferol; alpha-tocopheryl oxalate; Vitamin A [EPC]', '{isa} Biologically Active Substance', 'A1.4.1.1.3.4', 'Essential amino acids are not assigned to this type. They will be assigned to the type \\'Amino Acid, Peptide, or Protein\\'. This can be used with \\'Pharmacologic Substance\\' if the compound is being administered therapeutically or if the source has it classified as therapeutic (i.e., N\\'ICE Sugarless Vitamin C Drops).', 'vita', 1127, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141135, 'Immunologic Factor', 7, 'A biologically active substance whose activities affect or play a role in the functioning of the immune system.', 'Antigens; Immunologic Factors; Blood group antigen P', '{isa} Biologically Active Substance', 'A1.4.1.1.3.5', 'Antigens and antibodies are assigned to this type. Unlike most biologically active substances, some immunologic factors may be exogenous. Vaccines should be given this type and the type \\'Pharmacologic Substance\\'.', 'imft', 1129, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141136, 'Receptor', 7, 'A specific structure or site on the cell surface or within its cytoplasm that recognizes and binds with other specific molecules. These include the proteins on the surface of an immunocompetent cell that binds with antigens, or proteins found on the surface molecules that bind with hormones or neurotransmitters and react with other molecules that respond in a specific way.', 'Binding Sites; Lymphocyte antigen CD4 receptor; integrin alpha11beta1', '{isa} Biologically Active Substance', 'A1.4.1.1.3.6', NULL, 'rcpt', 1192, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141215, 'Nucleic Acid, Nucleoside, or Nucleotide', 7, 'A complex compound of high molecular weight occurring in living cells. These are basically of two types, ribonucleic (RNA) and deoxyribonucleic (DNA) acids. Nucleic acids are made of nucleotides (nitrogen-containing base, a 5-carbon sugar, and one or more phosphate group) linked together by a phosphodiester bond between the 5\\' and 3\\' carbon atoms. Nucleosides are compounds composed of a purine or pyrimidine base (usually adenine, cytosine, guanine, thymine, uracil) linked to either a ribose or a deoxyribose sugar.', 'Cytosine Nucleotides; Guanine; Oligonucleotides', '{isa} Organic Chemical', 'A1.4.1.2.1.5', 'Naturally occurring nucleic acids, nucleosides, or nucleotides will also be assigned a type from the \\'Biologically Active Substance\\' hierarchy.', 'nnon', 1114, NULL, 'STY'),\n", + "(3, 'Chemicals and Drugs', 'CHEM', 1141217, 'Amino Acid, Peptide, or Protein', 7, 'Amino acids and chains of amino acids connected by peptide linkages.', 'Amino Acids, Cyclic; Glycopeptides; Keratin', '{isa} Organic Chemical', 'A1.4.1.2.1.7', 'When the concept is both an enzyme and a protein, this type and the type \\'Enzyme\\' will be assigned.', 'aapp', 1116, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 12, 'Conceptual Entity', 2, 'A broad type for grouping abstract entities or concepts.', 'Geographic Factors; Fractals; Secularism', '{isa} Entity; {inverse_isa} Organism Attribute; {inverse_isa} Finding; {inverse_isa} Idea or Concept; {inverse_isa} Occupation or Discipline; {inverse_isa} Organization; {inverse_isa} Group; {inverse_isa} Group Attribute; {inverse_isa} Intellectual Product; {inverse_isa} Language', 'A2', 'Few concepts will be assigned to this broad type.', 'cnce', 1077, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 121, 'Idea or Concept', 3, 'An abstract concept, such as a social, religious or philosophical concept.', 'Capitalism; Civil Rights; Ethics', '{isa} Conceptual Entity; {inverse_isa} Temporal Concept; {inverse_isa} Qualitative Concept; {inverse_isa} Quantitative Concept; {inverse_isa} Spatial Concept; {inverse_isa} Functional Concept', 'A2.1', NULL, 'idcn', 1078, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 124, 'Intellectual Product', 3, 'A conceptual entity resulting from human endeavor. Concepts assigned to this type generally refer to information created by humans for some purpose.', 'Decision Support Techniques; Information Systems; Literature', '{isa} Conceptual Entity; {inverse_isa} Regulation or Law; {inverse_isa} Classification', 'A2.4', 'Concepts referring to theorems, models, and systems are assigned here. In some cases, a concept may be assigned to both \\'Intellectual Product\\' and \\'Research Activity\\'. For example, the concept \\\"Comparative Study\\\" might be viewed as both an activity and the result, or product, of that activity.', 'inpr', 1170, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 125, 'Language', 3, 'The system of communication used by a particular nation or people.', 'Armenian language; braille; Bilingualism', '{isa} Conceptual Entity', 'A2.5', NULL, 'lang', 1171, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 128, 'Group Attribute', 3, 'A conceptual entity which refers to the frequency or distribution of certain characteristics or phenomena in certain groups.', 'Family Size; Group Structure; Life Expectancy', '{isa} Conceptual Entity', 'A2.8', NULL, 'grpa', 1102, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1211, 'Temporal Concept', 4, 'A concept which pertains to time or duration.', 'Birth Intervals; Half-Life; Postoperative Period', '{isa} Idea or Concept', 'A2.1.1', 'If the concept refers to a phase, stage, cycle, interval, period, or rhythm, it is assigned to this type.', 'tmco', 1079, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1212, 'Qualitative Concept', 4, 'A concept which is an assessment of some quality, rather than a direct measurement.', 'Clinical Competence; Consumer Satisfaction; Health Status', '{isa} Idea or Concept', 'A2.1.2', NULL, 'qlco', 1080, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1213, 'Quantitative Concept', 4, 'A concept which involves the dimensions, quantity or capacity of something using some unit of measure, or which involves the quantitative comparison of entities.', 'Age Distribution; Metric System; Selection Bias', '{isa} Idea or Concept', 'A2.1.3', 'If the concept refers to rate or distribution, the type \\'Temporal Concept\\' is not also assigned.', 'qnco', 1081, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1214, 'Functional Concept', 4, 'A concept which is of interest because it pertains to the carrying out of a process or activity.', 'Interviewer Effect; Problem Formulation; Endogenous', '{isa} Idea or Concept; {inverse_isa} Body System', 'A2.1.4', NULL, 'ftcn', 1169, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1215, 'Spatial Concept', 4, 'A location, region, or space, generally having definite boundaries.', 'Mandibular Rest Position; Lateral; Extrinsic', '{isa} Idea or Concept; {inverse_isa} Body Location or Region; {inverse_isa} Body Space or Junction; {inverse_isa} Geographic Area; {inverse_isa} Molecular Sequence', 'A2.1.5', NULL, 'spco', 1082, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1241, 'Classification', 4, 'A term or system of terms denoting an arrangement by class or category.', 'Anatomy (MeSH Category); Tumor Stage Classification; axis i', '{isa} Intellectual Product', 'A2.4.1', NULL, 'clas', 1185, NULL, 'STY'),\n", + "(4, 'Concepts and Ideas', 'CONC', 1242, 'Regulation or Law', 4, 'An intellectual product resulting from legislative or regulatory activity.', 'Building Codes; Criminal Law; Health Planning Guidelines', '{isa} Intellectual Product', 'A2.4.2', NULL, 'rnlw', 1089, NULL, 'STY'),\n", + "(5, 'Devices', 'DEVI', 1131, 'Medical Device', 4, 'A manufactured object used primarily in the diagnosis, treatment, or prevention of physiologic or anatomic disorders.', 'Bone Screws; Headgear, Orthodontic; Compression Stockings', '{isa} Manufactured Object; {inverse_isa} Drug Delivery Device', 'A1.3.1', 'A medical device may be used for research purposes, but since its primary use is for routine medical care, it is distinguished from a \\'Research Device\\' which is used primarily for research purposes.', 'medd', 1074, NULL, 'STY'),\n", + "(5, 'Devices', 'DEVI', 1132, 'Research Device', 4, 'A manufactured object used primarily in carrying out scientific research or experimentation.', 'Electrodes, Enzyme; DNA Microarray Chip; Particle Count and Size Analyzer', '{isa} Manufactured Object', 'A1.3.2', 'A research device is distinguished from a \\'Medical Device\\', which though it may be used for research purposes is used primarily for routine medical care.', 'resd', 1075, NULL, 'STY'),\n", + "(5, 'Devices', 'DEVI', 11311, 'Drug Delivery Device', 5, 'A medical device that contains a clinical drug or drugs.', 'Nordette 21 Day Pack; {7 (Terazosin 1 MG Oral Tablet) / 7 (Terazosin 2 MG Oral Tablet) } Pack; {10 (cefdinir 300 MG Oral Capsule [Omnicef]) } Pack [Omni-Pac]', '{isa} Medical Device', 'A1.3.1.1', NULL, 'drdd', 1203, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 122, 'Finding', 3, 'That which is discovered by direct observation or measurement of an organism attribute or condition, including the clinical history of the patient. The history of the presence of a disease is a \\'Finding\\' and is distinguished from the disease itself.', 'Birth History; Downward displacement of diaphragm; Decreased glucose level', '{isa} Conceptual Entity; {inverse_isa} Laboratory or Test Result; {inverse_isa} Sign or Symptom', 'A2.2', 'Only in rare circumstances will findings be double-typed with either \\'Pathologic Function\\' or \\'Anatomical Abnormality\\'. Most findings will be assigned the types \\'Laboratory or Test Result\\' or \\'Sign or Symptom\\'. Only those findings that relate to patient history or to the determination of a state will be assigned the type \\'Finding\\'.', 'fndg', 1033, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 223, 'Injury or Poisoning', 3, 'A traumatic wound, injury, or poisoning caused by an external agent or force.', 'Accidental Falls; Carbon Monoxide Poisoning; Snake Bites', '{isa} Phenomenon or Process', 'B2.3', 'An `Injury or Poisoning\\' is distinguished from a \\'Disease or Syndrome\\' that may be a result of prolonged exposure to toxic materials.', 'inpo', 1037, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 1122, 'Anatomical Abnormality', 4, 'An abnormal structure, or one that is abnormal in size or location.', 'Bronchial Fistula; Foot Deformities; Hyperostosis of skull', '{isa} Anatomical Structure; {inverse_isa} Congenital Abnormality; {inverse_isa} Acquired Abnormality', 'A1.2.2', 'Use this type if the abnormality in question can be either an acquired or congenital abnormality. Neoplasms are not included here. These are given the type \\'Neoplastic Process\\'. If an anatomical abnormality has a pathologic manifestation, then it will additionally be given the type \\'Disease or Syndrome\\', e.g., \\\"Diabetic Cataract\\\" will be double-typed for this reason.', 'anab', 1190, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 1222, 'Sign or Symptom', 4, 'An observable manifestation of a disease or condition based on clinical judgment, or a manifestation of a disease or condition which is experienced by the patient and reported as a subjective observation.', 'Dyspnea; Nausea; Pain', '{isa} Finding', 'A2.2.2', NULL, 'sosy', 1184, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 11221, 'Congenital Abnormality', 5, 'An abnormal structure, or one that is abnormal in size or location, present at birth or evolving over time as a result of a defect in embryogenesis.', 'Albinism; Cleft palate with cleft lip; Polydactyly of toes', '{isa} Anatomical Abnormality', 'A1.2.2.1', 'If the congenital abnormality involves multiple defects then the type \\'Disease or Syndrome\\' will also be assigned.', 'cgab', 1019, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 11222, 'Acquired Abnormality', 5, 'An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., \\\"hernias incarcerate\\\").', 'Hemorrhoids; Hernia, Femoral; Cauliflower ear', '{isa} Anatomical Abnormality', 'A1.2.2.2', NULL, 'acab', 1020, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 22212, 'Pathologic Function', 5, 'A disordered process, activity, or state of the organism as a whole, of a body system or systems, or of multiple organs or tissues. Included here are normal responses to a negative stimulus as well as patholologic conditions or states that are less specific than a disease. Pathologic functions frequently have systemic effects.', 'Inflammation; Shock; Thrombosis', '{isa} Biologic Function; {inverse_isa} Disease or Syndrome; {inverse_isa} Cell or Molecular Dysfunction; {inverse_isa} Experimental Model of Disease', 'B2.2.1.2', 'If the process is specific, for example to a site or substance, then \\'Disease or Syndrome\\' will be assigned and not \\'Pathologic Function\\'. For example, \\\"cerebral anoxia\\\", \\\"brain edema\\\", and \\\"milk hypersensitivity\\\" will all be assigned to \\'Disease or Syndrome\\' only.', 'patf', 1046, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 222121, 'Disease or Syndrome', 6, 'A condition which alters or interferes with a normal process, state, or activity of an organism. It is usually characterized by the abnormal functioning of one or more of the host\\'s systems, parts, or organs. Included here is a complex of symptoms descriptive of a disorder.', 'Diabetes Mellitus; Drug Allergy; Malabsorption Syndrome', '{isa} Pathologic Function; {inverse_isa} Mental or Behavioral Dysfunction; {inverse_isa} Neoplastic Process', 'B2.2.1.2.1', 'Any specific disease or syndrome that is modified by such modifiers as \\\"acute\\\", \\\"prolonged\\\", etc. will also be assigned to this type. If an anatomic abnormality has a pathologic manifestation, then it will be given this type as well as a type from the \\'Anatomical Abnormality\\' hierarchy, e.g., \\\"Diabetic Cataract\\\" will be double-typed for this reason.', 'dsyn', 1047, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 222122, 'Cell or Molecular Dysfunction', 6, 'A pathologic function inherent to cells, parts of cells, or molecules.', 'DNA Damage; Wallerian Degeneration; Atypical squamous metaplasia', '{isa} Pathologic Function', 'B2.2.1.2.2', 'This is not intended to be a repository for diseases whose molecular basis has been established.', 'comd', 1049, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 222123, 'Experimental Model of Disease', 6, 'A representation in a non-human organism of a human disease for the purpose of research into its mechanism or treatment.', 'Alloxan Diabetes; Liver Cirrhosis, Experimental; Transient Gene Knock-Out Model', '{isa} Pathologic Function', 'B2.2.1.2.3', NULL, 'emod', 1050, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 2221211, 'Mental or Behavioral Dysfunction', 7, 'A clinically significant dysfunction whose major manifestation is behavioral or psychological. These dysfunctions may have identified or presumed biological etiologies or manifestations.', 'Agoraphobia; Cyclothymic Disorder; Frigidity', '{isa} Disease or Syndrome', 'B2.2.1.2.1.1', NULL, 'mobd', 1048, NULL, 'STY'),\n", + "(6, 'Disorders', 'DISO', 2221212, 'Neoplastic Process', 7, 'A new and abnormal growth of tissue in which the growth is uncontrolled and progressive. The growths may be malignant or benign.', 'Abdominal Neoplasms; Bowen\\'s Disease; Polyp in nasopharynx', '{isa} Disease or Syndrome', 'B2.2.1.2.1.2', 'All neoplasms are assigned to this type. Do not also assign a type from the \\'Anatomical Abnormality\\' hierarchy.', 'neop', 1191, NULL, 'STY'),\n", + "(7, 'Genes and Molecular Sequences', 'GENE', 11235, 'Gene or Genome', 5, 'A specific sequence, or in the case of the genome the complete sequence, of nucleotides along a molecule of DNA or RNA (in the case of some viruses) which represent the functional units of heredity.', 'Alleles; Genome, Human; rRNA Operon', '{isa} Fully Formed Anatomical Structure', 'A1.2.3.5', NULL, 'gngm', 1028, NULL, 'STY'),\n", + "(7, 'Genes and Molecular Sequences', 'GENE', 12153, 'Molecular Sequence', 5, 'A broad type for grouping the collected sequences of amino acids, carbohydrates, and nucleotide sequences. Descriptions of these sequences are generally reported in the published literature and/or are deposited in and maintained by databanks such as GenBank, European Molecular Biology Laboratory (EMBL), National Biomedical Research Foundation (NBRF), or other sequence repositories.', 'Genetic Code; Homologous Sequences; Molecular Sequence', '{isa} Spatial Concept; {inverse_isa} Nucleotide Sequence; {inverse_isa} Amino Acid Sequence; {inverse_isa} Carbohydrate Sequence', 'A2.1.5.3', NULL, 'mosq', 1085, NULL, 'STY'),\n", + "(7, 'Genes and Molecular Sequences', 'GENE', 121531, 'Nucleotide Sequence', 6, 'The sequence of purines and pyrimidines in nucleic acids and polynucleotides. Included here are nucleotide-rich regions, conserved sequence, and DNA transforming region.', 'Base Sequence; Direct Repeat; RNA Sequence', '{isa} Molecular Sequence', 'A2.1.5.3.1', NULL, 'nusq', 1086, NULL, 'STY'),\n", + "(7, 'Genes and Molecular Sequences', 'GENE', 121532, 'Amino Acid Sequence', 6, 'The sequence of amino acids as arrayed in chains, sheets, etc., within the protein molecule. It is of fundamental importance in determining protein structure.', 'Signal Peptides; Homologous Sequences, Amino Acid; Abnormal amino acid sequence', '{isa} Molecular Sequence', 'A2.1.5.3.2', NULL, 'amas', 1087, NULL, 'STY'),\n", + "(7, 'Genes and Molecular Sequences', 'GENE', 121533, 'Carbohydrate Sequence', 6, 'The sequence of carbohydrates within polysaccharides, glycoproteins, and glycolipids.', 'Carbohydrate Sequence; Abnormal carbohydrate sequence', '{isa} Molecular Sequence', 'A2.1.5.3.3', NULL, 'crbs', 1088, NULL, 'STY'),\n", + "(8, 'Geographic Areas', 'GEOG', 12154, 'Geographic Area', 5, 'A geographic location, generally having definite boundaries.', 'Baltimore; Canada; Far East', '{isa} Spatial Concept', 'A2.1.5.4', NULL, 'geoa', 1083, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 111, 'Organism', 3, 'Generally, a living individual, including all plants and animals.', 'Organism; Infectious agent; Heterotroph', '{isa} Physical Object; {inverse_isa} Virus; {inverse_isa} Bacterium; {inverse_isa} Archaeon; {inverse_isa} Eukaryote', 'A1.1', NULL, 'orgm', 1001, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 129, 'Group', 3, 'A conceptual entity referring to the classification of individuals according to certain shared characteristics.', 'Focus Groups; jury; teams', '{isa} Conceptual Entity; {inverse_isa} Professional or Occupational Group; {inverse_isa} Population Group; {inverse_isa} Family Group; {inverse_isa} Age Group; {inverse_isa} Patient or Disabled Group', 'A2.9', 'Few concepts will be assigned to this broad type.', 'grup', 1096, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1111, 'Archaeon', 4, 'A member of one of the three domains of life, formerly called Archaebacteria under the taxon Bacteria, but now considered separate and distinct. Archaea are characterized by: 1) the presence of characteristic tRNAs and ribosomal RNAs; 2) the absence of peptidoglycan cell walls; 3) the presence of ether-linked lipids built from branched-chain subunits; and 4) their occurrence in unusual habitats. While archaea resemble bacteria in morphology and genomic organization, they resemble eukarya in their method of genomic replication.', 'Thermoproteales; Haloferax volcanii; Methanospirillum', '{isa} Organism', 'A1.1.1', NULL, 'arch', 1194, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1112, 'Bacterium', 4, 'A small, typically one-celled, prokaryotic micro-organism.', 'Acetobacter; Bacillus cereus; Cytophaga', '{isa} Organism', 'A1.1.2', NULL, 'bact', 1007, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1113, 'Eukaryote', 4, 'One of the three domains of life (the others being Bacteria and Archaea), also called Eukarya. These are organisms whose cells are enclosed in membranes and possess a nucleus. They comprise almost all multicellular and many unicellular organisms, and are traditionally divided into groups (sometimes called kingdoms) including Animals, Plants, Fungi, various Algae, and other taxa that were previously part of the old kingdom Protista.', 'Order Acarina; Bees; Plasmodium malariae', '{isa} Organism; {inverse_isa} Plant; {inverse_isa} Fungus; {inverse_isa} Animal', 'A1.1.3', NULL, 'euka', 1204, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1114, 'Virus', 4, 'An organism consisting of a core of a single nucleic acid enclosed in a protective coat of protein. A virus may replicate only inside a host living cell. A virus exhibits some but not all of the usual characteristics of living things.', 'Coliphages; Echovirus; Parvoviridae', '{isa} Organism', 'A1.1.4', NULL, 'virs', 1005, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1291, 'Professional or Occupational Group', 4, 'An individual or individuals classified according to their vocation.', 'Clergy; Demographers; Hospital Volunteers', '{isa} Group', 'A2.9.1', 'If the concept refers to the discipline or vocation itself, rather than to the individuals who have the vocation, then the type \\'Occupation or Discipline\\' will be assigned instead.', 'prog', 1097, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1292, 'Population Group', 4, 'An indivdual or individuals classified according to their sex, racial origin, religion, common place of living, financial or social status, or some other cultural or behavioral attribute.', 'Asian Americans; Ethnic group; Adult Offenders', '{isa} Group', 'A2.9.2', NULL, 'popg', 1098, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1293, 'Family Group', 4, 'An individual or individuals classified according to their family relationships or relative position in the family unit.', 'Daughter; Is an only child; Unmarried Fathers', '{isa} Group', 'A2.9.3', NULL, 'famg', 1099, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1294, 'Age Group', 4, 'An individual or individuals classified according to their age.', 'Adult; Infant, Premature; Adolescent (age group)', '{isa} Group', 'A2.9.4', NULL, 'aggp', 1100, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1295, 'Patient or Disabled Group', 4, 'An individual or individuals classified according to a disability, disease, condition or treatment.', 'Amputees; Institutionalized Child; Mentally Ill Persons', '{isa} Group', 'A2.9.5', NULL, 'podg', 1101, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 11131, 'Animal', 5, 'An organism with eukaryotic cells, and lacking stiff cell walls, plastids and photosynthetic pigments.', 'Animals; Animals, Laboratory; Carnivore', '{isa} Eukaryote; {inverse_isa} Vertebrate', 'A1.1.3.1', NULL, 'anim', 1008, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 11132, 'Fungus', 5, 'A eukaryotic organism characterized by the absence of chlorophyll and the presence of a rigid cell wall. Included here are both slime molds and true fungi such as yeasts, molds, mildews, and mushrooms.', 'Aspergillus clavatus; Blastomyces; Neurospora', '{isa} Eukaryote', 'A1.1.3.2', NULL, 'fngs', 1004, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 11133, 'Plant', 5, 'An organism having cellulose cell walls, growing by synthesis of inorganic substances, generally distinguished by the presence of chlorophyll, and lacking the power of locomotion. Plant parts are included here as well.', 'Aloe; Pollen; Helianthus species', '{isa} Eukaryote', 'A1.1.3.3', NULL, 'plnt', 1002, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 111311, 'Vertebrate', 6, 'An animal which has a spinal column.', 'Vertebrates; Gnathostomata vertebrate; Craniata ', '{isa} Animal; {inverse_isa} Amphibian; {inverse_isa} Bird; {inverse_isa} Fish; {inverse_isa} Reptile; {inverse_isa} Mammal', 'A1.1.3.1.1', 'Few concepts will be assigned to this broad type.', 'vtbt', 1010, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1113111, 'Amphibian', 7, 'A cold-blooded, smooth-skinned vertebrate which characteristically hatches as an aquatic larva, breathing by gills. When mature, the amphibian breathes with lungs.', 'Salamandra; Urodela; Brazilian horned frog', '{isa} Vertebrate', 'A1.1.3.1.1.1', NULL, 'amph', 1011, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1113112, 'Bird', 7, 'A vertebrate having a constant body temperature and characterized by the presence of feathers.', 'Serinus; Ducks; Quail', '{isa} Vertebrate', 'A1.1.3.1.1.2', NULL, 'bird', 1012, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1113113, 'Fish', 7, 'A cold-blooded aquatic vertebrate characterized by fins and breathing by gills. Included here are fishes having either a bony skeleton, such as a perch, or a cartilaginous skeleton, such as a shark, or those lacking a jaw, such as a lamprey or hagfish.', 'Bass; Salmonidae; Whitefish', '{isa} Vertebrate', 'A1.1.3.1.1.3', NULL, 'fish', 1013, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1113114, 'Mammal', 7, 'A vertebrate having a constant body temperature and characterized by the presence of hair, mammary glands and sweat glands.', 'Ursidae Family; Hamsters; Macaca', '{isa} Vertebrate; {inverse_isa} Human', 'A1.1.3.1.1.4', NULL, 'mamm', 1015, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 1113115, 'Reptile', 7, 'A cold-blooded vertebrate having an external covering of scales or horny plates. Reptiles breathe by means of lungs and are generally egg-laying.', 'Alligators; Water Mocassin; Genus Python (organism)', '{isa} Vertebrate', 'A1.1.3.1.1.5', NULL, 'rept', 1014, NULL, 'STY'),\n", + "(9, 'Living Beings', 'LIVE', 11131141, 'Human', 8, 'Modern man, the only remaining species of the Homo genus.', 'Homo sapiens; jean piaget; Member of public', '{isa} Mammal', 'A1.1.3.1.1.4.1', 'If a concept describes a human being from the point of view of occupational, family, social status, etc., then a type from the \\'Group\\' hierarchy will be assigned instead.', 'humn', 1016, NULL, 'STY'),\n", + "(10, 'Objects', 'OBJC', 1, 'Entity', 1, 'A broad type for grouping physical and conceptual entities.', 'Gifts, Financial; Image; Product Part', '{inverse_isa} Physical Object; {inverse_isa} Conceptual Entity', 'A', 'Few concepts will be assigned to this broad type.', 'enty', 1071, NULL, 'STY'),\n", + "(10, 'Objects', 'OBJC', 11, 'Physical Object', 2, 'An object perceptible to the sense of vision or touch.', 'Printed Media; Meteors; Physical object', '{isa} Entity; {inverse_isa} Organism; {inverse_isa} Anatomical Structure; {inverse_isa} Manufactured Object; {inverse_isa} Substance', 'A1', NULL, 'phob', 1072, NULL, 'STY'),\n", + "(10, 'Objects', 'OBJC', 113, 'Manufactured Object', 3, 'A physical object made by human beings.', 'car seat; Cooking and Eating Utensils; Goggles', '{isa} Physical Object; {inverse_isa} Medical Device; {inverse_isa} Research Device; {inverse_isa} Clinical Drug', 'A1.3', NULL, 'mnob', 1073, NULL, 'STY'),\n", + "(10, 'Objects', 'OBJC', 114, 'Substance', 3, 'A material with definite or fairly definite chemical composition.', 'Air (substance); Fossils; Plastics', '{isa} Physical Object; {inverse_isa} Body Substance; {inverse_isa} Chemical; {inverse_isa} Food', 'A1.4', NULL, 'sbst', 1167, NULL, 'STY'),\n", + "(10, 'Objects', 'OBJC', 1143, 'Food', 4, 'Any substance generally containing nutrients, such as carbohydrates, proteins, and fats, that can be ingested by a living organism and metabolized into energy and body tissue. Some foods are naturally occurring, others are either partially or entirely made by humans.', 'Beverages; Egg Yolk (Dietary); Ice Cream', '{isa} Substance', 'A1.4.3', 'Food additives, food preservatives, and food dyes should be given the type \\'Chemical Viewed Functionally\\'; \\\"Diet Coke\\\" would be assigned this type.', 'food', 1168, NULL, 'STY'),\n", + "(11, 'Occupations', 'OCCU', 126, 'Occupation or Discipline', 3, 'A vocation, academic discipline, or field of study, or a subpart of an occupation or discipline.', 'Aviation; Craniology; Ecology', '{isa} Conceptual Entity; {inverse_isa} Biomedical Occupation or Discipline', 'A2.6', 'If the concept refers to the individuals who have the vocation, the type \\'Professional or Occupational Group\\' will be assigned instead.', 'ocdi', 1090, NULL, 'STY'),\n", + "(11, 'Occupations', 'OCCU', 1261, 'Biomedical Occupation or Discipline', 4, 'A vocation, academic discipline, or field of study related to biomedicine.', 'Adolescent Medicine; Cellular Neurobiology; Dentistry', '{isa} Occupation or Discipline', 'A2.6.1', NULL, 'bmod', 1091, NULL, 'STY'),\n", + "(12, 'Organizations', 'ORGA', 127, 'Organization', 3, 'The result of uniting for a common purpose or function. The continued existence of an organization is not dependent on any of its members, its location, or particular facility. Components or subparts of organizations are also included here. Although the names of organizations are sometimes used to refer to the buildings in which they reside, they are not inherently physical in nature.', 'Labor Unions; United Nations; Boarding school', '{isa} Conceptual Entity; {inverse_isa} Health Care Related Organization; {inverse_isa} Professional Society; {inverse_isa} Self-help or Relief Organization', 'A2.7', NULL, 'orgt', 1092, NULL, 'STY'),\n", + "(12, 'Organizations', 'ORGA', 1271, 'Health Care Related Organization', 4, 'An established organization which carries out specific functions related to health care delivery or research in the life sciences.', 'Centers for Disease Control and Prevention (U.S.); Halfway Houses; Hospitals, Pediatric', '{isa} Organization', 'A2.7.1', 'Concepts for health care related professional societies are assigned the type \\'Professional Society\\'.', 'hcro', 1093, NULL, 'STY'),\n", + "(12, 'Organizations', 'ORGA', 1272, 'Professional Society', 4, 'An organization uniting those who have a common vocation or who are involved with a common field of study.', 'American Medical Association; International Council of Nurses; Library', '{isa} Organization', 'A2.7.2', NULL, 'pros', 1094, NULL, 'STY'),\n", + "(12, 'Organizations', 'ORGA', 1273, 'Self-help or Relief Organization', 4, 'An organization whose purpose and function is to provide assistance to the needy or to offer support to those sharing similar problems.', 'Alcoholics Anonymous; Charities - organization; Red Cross', '{isa} Organization', 'A2.7.3', NULL, 'shro', 1095, NULL, 'STY'),\n", + "(13, 'Phenomena', 'PHEN', 22, 'Phenomenon or Process', 2, 'A process or state which occurs naturally or as a result of an activity.', 'Disasters; Motor Traffic Accidents; Depolymerization', '{isa} Event; {inverse_isa} Injury or Poisoning; {inverse_isa} Human-caused Phenomenon or Process; {inverse_isa} Natural Phenomenon or Process', 'B2', NULL, 'phpr', 1067, NULL, 'STY'),\n", + "(13, 'Phenomena', 'PHEN', 221, 'Human-caused Phenomenon or Process', 3, 'A phenomenon or process that is a result of the activities of human beings.', 'Baby Boom; Cultural Evolution; Mass Media', '{isa} Phenomenon or Process; {inverse_isa} Environmental Effect of Humans', 'B2.1', 'If the concept refers to the activity itself, rather than the result of that activity, a type from the \\'Activity\\' hierarchy will be assigned instead.', 'hcpp', 1068, NULL, 'STY'),\n", + "(13, 'Phenomena', 'PHEN', 222, 'Natural Phenomenon or Process', 3, 'A phenomenon or process that occurs irrespective of the activities of human beings.', 'Air Movements; Corrosion; Lightning (phenomenon)', '{isa} Phenomenon or Process; {inverse_isa} Biologic Function', 'B2.2', NULL, 'npop', 1070, NULL, 'STY'),\n", + "(13, 'Phenomena', 'PHEN', 1221, 'Laboratory or Test Result', 4, 'The outcome of a specific test to measure an attribute or to determine the presence, absence, or degree of a condition.', 'Blood Flow Velocity; Serum Calcium Level; Spinal Fluid Pressure', '{isa} Finding', 'A2.2.1', 'Laboratory or test results are considered inherently quantitative and, thus, are not assigned the additional type \\'Quantitative Concept\\'.', 'lbtr', 1034, NULL, 'STY');\n", + "INSERT INTO `semantic_network` (`SemanticGroupCode`, `SemanticGroup`, `SemanticGroupAbr`, `CustomTreeNumber`, `SemanticTypeName`, `BranchPosition`, `Definition`, `Examples`, `RelationName`, `SemTypeTreeNo`, `UsageNote`, `Abbreviation`, `UniqueID`, `NonHumanFlag`, `RecordType`) VALUES\n", + "(13, 'Phenomena', 'PHEN', 2211, 'Environmental Effect of Humans', 4, 'A change in the natural environment that is a result of the activities of human beings.', 'Air Pollution; Desertification; Bioremediation', '{isa} Human-caused Phenomenon or Process', 'B2.1.1', NULL, 'eehu', 1069, NULL, 'STY'),\n", + "(13, 'Phenomena', 'PHEN', 2221, 'Biologic Function', 4, 'A state, activity or process of the body or one of its systems or parts.', 'Antibody Formation; Drug resistance; Homeostasis', '{isa} Natural Phenomenon or Process; {inverse_isa} Physiologic Function; {inverse_isa} Pathologic Function', 'B2.2.1', 'Few concepts will be assigned to this broad type.', 'biof', 1038, 'Y', 'STY'),\n", + "(14, 'Physiology', 'PHYS', 123, 'Organism Attribute', 3, 'A property of the organism or its major parts.', 'Age; Birth Weight; Eye Color', '{isa} Conceptual Entity; {inverse_isa} Clinical Attribute', 'A2.3', NULL, 'orga', 1032, 'Y', 'STY'),\n", + "(14, 'Physiology', 'PHYS', 1231, 'Clinical Attribute', 4, 'An observable or measurable property or state of an organism of clinical interest.', 'Bone Density; heart rate; Range of Motion, Articular', '{isa} Organism Attribute', 'A2.3.1', 'These are the attributes that are being evaluated or measured, not the results of the evaluation.', 'clna', 1201, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 22211, 'Physiologic Function', 5, 'A normal process, activity, or state of the body.', 'Biorhythms; Hearing; Vasodilation', '{isa} Biologic Function; {inverse_isa} Organism Function; {inverse_isa} Organ or Tissue Function; {inverse_isa} Cell Function; {inverse_isa} Molecular Function', 'B2.2.1.1', NULL, 'phsf', 1039, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 222111, 'Organism Function', 6, 'A physiologic function of the organism as a whole, of multiple organ systems, or of multiple organs or tissues.', 'Breeding; Hibernation; Motor Skills', '{isa} Physiologic Function; {inverse_isa} Mental Process', 'B2.2.1.1.1', NULL, 'orgf', 1040, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 222112, 'Organ or Tissue Function', 6, 'A physiologic function of a particular organ, organ system, or tissue.', 'Osteogenesis; Renal Circulation; Tooth Calcification', '{isa} Physiologic Function', 'B2.2.1.1.2', NULL, 'ortf', 1042, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 222113, 'Cell Function', 6, 'A physiologic function inherent to cells or cell components.', 'Cell Cycle; Cell division; Phagocytosis', '{isa} Physiologic Function', 'B2.2.1.1.3', NULL, 'celf', 1043, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 222114, 'Molecular Function', 6, 'A physiologic function occurring at the molecular level.', 'Binding, Competitive; Electron Transport; Glycolysis', '{isa} Physiologic Function; {inverse_isa} Genetic Function', 'B2.2.1.1.4', NULL, 'moft', 1044, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 2221111, 'Mental Process', 7, 'A physiologic function involving the mind or cognitive processing.', 'Anger; Auditory Fatigue; Avoidance Learning', '{isa} Organism Function', 'B2.2.1.1.1.1', NULL, 'menp', 1041, NULL, 'STY'),\n", + "(14, 'Physiology', 'PHYS', 2221141, 'Genetic Function', 7, 'Functions of or related to the maintenance, translation or expression of the genetic material.', 'Early Gene Transcription; Gene Amplification; RNA Splicing', '{isa} Molecular Function', 'B2.2.1.1.4.1', NULL, 'genf', 1045, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 2131, 'Health Care Activity', 4, 'An activity of or relating to the practice of medicine or involving the care of patients.', 'ambulatory care services; Clinic Activities; Preventive Health Services', '{isa} Occupational Activity; {inverse_isa} Laboratory Procedure; {inverse_isa} Diagnostic Procedure; {inverse_isa} Therapeutic or Preventive Procedure', 'B1.3.1', NULL, 'hlca', 1058, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 2132, 'Research Activity', 4, 'An activity carried out as part of research or experimentation.', 'Animal Experimentation; Biomedical Research; Experimental Replication', '{isa} Occupational Activity; {inverse_isa} Molecular Biology Research Technique', 'B1.3.2', 'In some cases, a concept may be assigned to both this type and the type \\'Intellectual Product\\'. For example, the concept \\\"Comparative Study\\\" might be viewed as both an activity and the result, or product, of that activity.', 'resa', 1062, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 2134, 'Educational Activity', 4, 'An activity related to the organization and provision of education.', 'Academic Training; Family Planning Training; Preceptorship', '{isa} Occupational Activity', 'B1.3.4', NULL, 'edac', 1065, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 21311, 'Laboratory Procedure', 5, 'A procedure, method, or technique used to determine the composition, quantity, or concentration of a specimen, and which is carried out in a clinical laboratory. Included here are procedures which measure the times and rates of reactions.', 'Blood Protein Electrophoresis; Crystallography; Radioimmunoassay', '{isa} Health Care Activity', 'B1.3.1.1', NULL, 'lbpr', 1059, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 21312, 'Diagnostic Procedure', 5, 'A procedure, method, or technique used to determine the nature or identity of a disease or disorder. This excludes procedures which are primarily carried out on specimens in a laboratory.', 'Biopsy; Heart Auscultation; Magnetic Resonance Imaging', '{isa} Health Care Activity', 'B1.3.1.2', NULL, 'diap', 1060, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 21313, 'Therapeutic or Preventive Procedure', 5, 'A procedure, method, or technique designed to prevent a disease or a disorder, or to improve physical function, or used in the process of treating a disease or injury.', 'Cesarean section; Dermabrasion; Family psychotherapy', '{isa} Health Care Activity', 'B1.3.1.3', NULL, 'topp', 1061, NULL, 'STY'),\n", + "(15, 'Procedures', 'PROC', 21321, 'Molecular Biology Research Technique', 5, 'Any of the techniques used in the study of or the directed modification of the gene complement of a living organism.', 'Northern Blotting; Genetic Engineering; In Situ Hybridization', '{isa} Research Activity', 'B1.3.2.1', NULL, 'mbrt', 1063, NULL, 'STY');\n", + "'''" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/07_UI_building.ipynb b/07_UI_building.ipynb new file mode 100644 index 0000000..88aa041 --- /dev/null +++ b/07_UI_building.ipynb @@ -0,0 +1,135 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 7. UI building\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** Build UI information
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "## Script contents\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "\n", + "* [ ] \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 1. Start-up / What to put into place, where\n", + "# ============================================\n", + "\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import os\n", + "\n", + "''' 100-percent content inventory from SEO Spider or other. Our allPages \n", + " dataframe is based on a 100-percent content inventory, so we \n", + " can analyze pages with zero traffic or zero searches. Also includes\n", + " the page title, date the page was last updated - lots of rich info.\n", + "- Summary stats by communication package, from content inventory.'''\n", + "contentInventoryFileName = '00 SourceFiles/page.csv'\n", + "packageSummaryFileName = '00 SourceFiles/group.csv'\n", + "\n", + "''' Traffic log. This script assumes Google Analytics unsampled report; \n", + "references two column names: Page and Unique Pageviews. I export \n", + "report header so I'll know later what is in the file, which means my \n", + "import command skips the first ~6 rows.'''\n", + "newTrafficFileName = '00 SourceFiles/Pages_Q2.csv'\n", + "\n", + "'''\n", + "The following custom dictionary files need to be in place in /01/Pre-process\n", + "\n", + "GoldStandard.csv - Already-assigned term list, from UMLS and other sources, \n", + " vetted.\n", + "NamedEntities.csv - Known entities such as person names, product names, acronyms, \n", + " abbreviations, org parts, etc. Will overlap with GoldStandard; however, \n", + " UPDATE THIS FILE and this will replicate over to GoldStandard.\n", + "MisspelledOrForeign.csv - Short list of frequently misspelled words with HIGH\n", + " confidence that they can be replaced without review. Okay to include\n", + " foreign words.\n", + "'''\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Plots" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Pie for percentage of rows assigned; https://pythonspot.com/matplotlib-pie-chart/\n", + "totCount = len(logWithGoldStandard)\n", + "unassigned = logWithGoldStandard['SemanticGroup'].isnull().sum()\n", + "assigned = totCount - unassigned\n", + "labels = ['Assigned', 'Unassigned']\n", + "sizes = [assigned, unassigned]\n", + "colors = ['lightskyblue', 'lightcoral']\n", + "explode = (0.1, 0) # explode 1st slice\n", + "plt.pie(sizes, explode=explode, labels=labels, colors=colors,\n", + " autopct='%1.f%%', shadow=True, startangle=100)\n", + "plt.axis('equal')\n", + "plt.title(\"Status after 'GoldStandard' processing\")\n", + "plt.show()\n", + "\n", + "\n", + "# Bar of SemanticGroup categories, horizontal\n", + "# Source: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html\n", + "ax = logWithGoldStandard['SemanticGroup'].value_counts().plot(kind='barh', figsize=(10,6),\n", + " color=\"slateblue\", fontsize=10);\n", + "ax.set_alpha(0.8)\n", + "ax.set_title(\"Categories assigned after 'GoldStandard' processing\", fontsize=14)\n", + "ax.set_xlabel(\"Number of searches\", fontsize=9);\n", + "# set individual bar lables using above list\n", + "for i in ax.patches:\n", + " # get_width pulls left or right; get_y pushes up or down\n", + " ax.text(i.get_width()+.1, i.get_y()+.31, \\\n", + " str(round((i.get_width()), 2)), fontsize=9, color='dimgrey')\n", + "# invert for largest on top \n", + "ax.invert_yaxis()\n", + "plt.gcf().subplots_adjust(left=0.3)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/08_Misc_fixes.ipynb b/08_Misc_fixes.ipynb new file mode 100644 index 0000000..a0c50ee --- /dev/null +++ b/08_Misc_fixes.ipynb @@ -0,0 +1,306 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Part 8. Misc fixes\n", + "App to analyze web-site search logs (internal search)
\n", + "**This script:** Re-usable code that doesn't belong anywhere in particular
\n", + "Authors: dan.wendling@nih.gov,
\n", + "Last modified: 2018-09-09\n", + "\n", + "## Script contents\n", + "\n", + "\n", + "## FIXMEs\n", + "\n", + "Things Dan wrote for Dan; modify as needed. There are more FIXMEs in context.\n", + "\n", + "* [ ] \n", + "\n", + "\n", + "Found this useful: https://stackoverflow.com/questions/tagged/matplotlib\n", + "\n", + "## 1. Start-up / What to put into place, where\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import os\n", + "\n", + "# Set working directory\n", + "os.chdir('/Users/wendlingd/Projects/webDS/_util')\n", + "\n", + "localDir = '08_Misc_fixes/' # Different than others, see about changing\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 2. Load and clean logAfterUmlsApi1\n", + "# ===================================\n", + "\n", + "logAfterUmlsApi1 = pd.read_excel('02_Run_APIs_files/logAfterUmlsApi1.xlsx')\n", + "\n", + "logAfterUmlsApi1.loc[logAfterUmlsApi1['preferredTerm'].str.contains('^BLAST (physical force)', na=False), 'preferredTerm'] = 'Bibliographic Entity'\n", + "\n", + "\n", + "logAfterUmlsApi1['preferredTerm'] = logAfterUmlsApi1['preferredTerm'].str.replace(\"^BLAST \\(physical force\\)\", \"^BLAST$\", regex=True)\n", + "\n", + "logAfterUmlsApi1['preferredTerm'] = logAfterUmlsApi1['preferredTerm'].str.replace(\"BLAST Link\", \"BLAST\", regex=False)\n", + "\n", + "huh = logAfterUmlsApi1[logAfterUmlsApi1.adjustedQueryCase.str.startswith(\"blast\") == True] # retrieve records to eyeball\n", + "huh = huh.groupby('preferredTerm').size()\n", + "\n", + "logAfterUmlsApi1['preferredTerm'] = logAfterUmlsApi1['preferredTerm'].str.replace('Bibliographic Reference', 'Bibliographic Entity')\n", + "\n", + "logAfterUmlsApi1['preferredTerm'] = logAfterUmlsApi1['preferredTerm'].str.replace('Mesh surgical material', 'MeSH')\n", + "\n", + "# Write out the fixed file\n", + "writer = pd.ExcelWriter('02_Run_APIs_files/logAfterUmlsApi1.xlsx')\n", + "logAfterUmlsApi1.to_excel(writer,'logAfterUmlsApi1')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# VIEW PREVIOUS ASSIGNMENTS IN GoldStandard_master\n", + "\n", + "\n", + "from matplotlib.pyplot import pie, axis, show\n", + "import numpy as np\n", + "import os\n", + "import string\n", + "\n", + "\n", + "# Bring in historical file of (somewhat edited) matches\n", + "GoldStandard = localDir + 'GoldStandard_Master.xlsx'\n", + "GoldStandard = pd.read_excel(GoldStandard)\n", + "\n", + "GoldStandard = GoldStandard[pd.notnull(GoldStandard['SemanticGroup'])]\n", + "\n", + "'''\n", + "SELECT * FROM `manual_assignments` \n", + "WHERE preferredTerm IS NULL\n", + "ORDER BY NewSemanticTypeName` DESC\n", + "\n", + "\n", + "preferredTerm, SemanticTypeName, SemanticGroup\n", + "'''\n", + "\n", + "df2 = GoldStandard[GoldStandard.preferredTerm.str.contains(\"photo\") == True]\n", + "df2 = GoldStandard[GoldStandard.SemanticTypeName.str.contains(\"foreign\") == True]\n", + "df2 = GoldStandard[GoldStandard.SemanticGroup.str.contains(\"foreign\") == True]\n", + "\n", + "\n", + "\n", + "\n", + "df = df.groupby('adjustedQueryCase').size()\n", + "df = pd.DataFrame({'timesSearched':df})\n", + "\n", + "\n", + "GoldStandard = GoldStandard.sort_values(by='timesSearched', ascending=False)\n", + "GoldStandard = GoldStandard.reset_index()\n", + "\n", + "\n", + "sum1 = logAfterUmlsApi1.groupby('SemanticTypeName').size()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# READ FROM SQL TO DATAFRAME\n", + "\n", + "\n", + "from sqlalchemy import create_engine\n", + "\n", + "dbconn = create_engine('mysql+mysqlconnector://wendlingd:DataSciPwr17@localhost/ia')\n", + "\n", + "\n", + "# Extract from MySQL to df\n", + "mayJuneLog = pd.read_sql('SELECT * FROM timeboundhmpglog', con=dbconn)\n", + "\n", + "\n", + "\n", + "# Write this to file (assuming multiple cycles)\n", + "writer = pd.ExcelWriter(localDir + 'mayJuneLog.xlsx')\n", + "mayJuneLog.to_excel(writer,'timeboundhmpglog')\n", + "writer.save()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# UPDATE SEMANTIC NETWORK table IN MYSQL\n", + "'''\n", + "DROP TABLE IF EXISTS `semantic_network`;\n", + "CREATE TABLE `semantic_network` (\n", + " `semnet_id` INT PRIMARY KEY NOT NULL AUTO_INCREMENT,\n", + " `SemanticGroupCode` int(11) NOT NULL,\n", + " `SemanticGroup` varchar(60) NOT NULL,\n", + " `SemanticGroupAbr` varchar(10) NOT NULL,\n", + " `CustomTreeNumber` int(11) NOT NULL,\n", + " `SemanticTypeName` varchar(100) NOT NULL,\n", + " `BranchPosition` int(11) NOT NULL,\n", + " `Definition` varchar(200) NOT NULL,\n", + " `Examples` varchar(100) NOT NULL,\n", + " `RelationName` varchar(60) NOT NULL,\n", + " `SemTypeTreeNo` varchar(60) NOT NULL,\n", + " `UsageNote` varchar(60) NOT NULL,\n", + " `Abbreviation` varchar(60) NOT NULL,\n", + " `UniqueID` int(11) NOT NULL,\n", + " `NonHumanFlag` varchar(60) NOT NULL,\n", + " `RecordType` varchar(60) NOT NULL\n", + ") ENGINE=InnoDB DEFAULT CHARSET=utf8;\n", + "'''\n", + "\n", + "SemanticNetworkReference = pd.read_excel('01_Text_wrangling_files/SemanticNetworkReference.xlsx')\n", + "\n", + "SemanticNetworkReference.columns\n", + "\n", + "\n", + "# Add dataframe to MySQL\n", + "\n", + "import mysql.connector\n", + "from pandas.io import sql\n", + "from sqlalchemy import create_engine\n", + "\n", + "dbconn = create_engine('mysql+mysqlconnector://wendlingd:DataSciPwr17@localhost/ia')\n", + "\n", + "SemanticNetworkReference.to_sql(name='semantic_network', con=dbconn, if_exists = 'replace', index=False) # or if_exists='append'\n", + "\n", + "# Reduce to needed columns\n", + "listCol = SemanticNetworkReference[['SemanticGroupCode', 'SemanticGroup']]\n", + "\n", + "listCol = listCol.drop_duplicates('SemanticGroup')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# RE-NAME categories\n", + "\n", + "'''\n", + "SemanticGroup\n", + " Citation, PubMed strategy, complex, unclear, etc.\n", + "\n", + "logAfterFuzzyMatch\n", + "\n", + "'''\n", + "\n", + "\n", + "logAfterFuzzyMatch['preferredTerm'] = logAfterFuzzyMatch['preferredTerm'].str.replace('Bibliographic Entity', 'PubMed strategy, citation, unclear, etc.')\n", + "\n", + "logAfterFuzzyMatch.loc[logAfterFuzzyMatch['preferredTerm'].str.startswith('Bibliographic Entity', na=False), 'SemanticGroup'] = 'Unparsed'\n", + "logAfterFuzzyMatch.loc[logAfterFuzzyMatch['preferredTerm'].str.startswith('Bibliographic Entity', na=False), 'SemanticTypeName'] = 'Unparsed'\n", + "\n", + "\n", + "logAfterFuzzyMatch.loc[logAfterFuzzyMatch['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticGroup'] = 'Accession Number'\n", + "logAfterFuzzyMatch.loc[logAfterFuzzyMatch['preferredTerm'].str.contains('Numeric Entity', na=False), 'SemanticTypeName'] = 'Accession Number'\n", + "\n", + "\n", + "writer = pd.ExcelWriter('03_Fuzzy_match_files/logAfterFuzzyMatch.xlsx')\n", + "logAfterFuzzyMatch.to_excel(writer,'logAfterFuzzyMatch')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n", + "\n", + "\n", + "\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^benefits of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^cause of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^cause for ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^causes for ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^causes of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^definition for ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^definition of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^effect of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^etiology of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^symptoms of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^treating ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^treatment for ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^treatments for ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^treatment of ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^what are ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^what causes ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^what is a ', '')\n", + "logAfterFuzzyMatch['adjustedQueryCase'] = logAfterFuzzyMatch['adjustedQueryCase'].str.replace('^what is ', '')\n", + "\n", + "\n", + "writer = pd.ExcelWriter('03_Fuzzy_match_files/logAfterFuzzyMatch.xlsx')\n", + "logAfterFuzzyMatch.to_excel(writer,'logAfterFuzzyMatch')\n", + "# df2.to_excel(writer,'Sheet2')\n", + "writer.save()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "logAfterFuzzyMatch = logAfterFuzzyMatch.replace(np.nan, 'Unparsed', regex=True)\n", + "\n", + "logAfterFuzzyMatch['preferredTerm'] = logAfterFuzzyMatch['preferredTerm'].str.replace('National Center for Biotechnology Information', 'NCBI')\n", + "\n", + "logAfterFuzzyMatch['preferredTerm'] = logAfterFuzzyMatch['preferredTerm'].str.replace('Formatted References for Authors of Journal Articles', 'Refs for J Article Authors')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}