from elasticsearch import Elasticsearch, RequestsHttpConnection
import nbformat as nbf
import warnings
import requests
warnings.filterwarnings("ignore")
import glob
The problem? I have many Jupyter Notebooks, and I need an easy to search through them all. I am always remembering that I have some snippet of code somewhere in these notebooks so need an easy to way to find it.
Enter Elasticsearch
Elasticsearch "is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents" That means that you can put lots of documents full of text into it, and it will index all of this, and make it easy to search
The Elasticsearch folks have also built Kibana, which is "proprietary data visualization dashboard software for Elasticsearch". That means that Kibana is a handy GUI tool you can use to quickly search your data, similiar to using something like a Google searchbox
In this notebook, I will demonstrate how to:
Prerequisites:
That you know how to create a Jupyter Notebook and save it somewhere, and that you have Docker installed.
Caveat:
This assumes your use case is that you want to easily be able to search through your own notebooks as part of your workflow, and as such I am going to ignore some Elasticsearch security options. Which is why I have an ignore warnings filter in this notebook. But don't use this approach if you are planning somehing that is not dev.
This will all take about 5 minutes.
Set up and Hello World
Let's start with some setup. Open a Terminal on your machine (bash on Linux, PowerShell on Windows, whatever). Then run the following commands.
docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.1
docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.1
So that's it for installing and getting Elasticsearch up and running. Now let's do the same for Kibana. Open a new shell and run the following commands:
docker pull docker.elastic.co/kibana/kibana:7.13.1
docker run --name kib01-test --net elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.13.1
So looks like it is working. Let's make sure our notebook can see it to:
res = requests.get('http://host.docker.internal:9200')
print(res.content)
b'{\n "name" : "829766e7847b",\n "cluster_name" : "docker-cluster",\n "cluster_uuid" : "mLWu9gbQQqqOy5xB3IONVg",\n "version" : {\n "number" : "7.13.1",\n "build_flavor" : "default",\n "build_type" : "docker",\n "build_hash" : "9a7758028e4ea59bcab41c12004603c5a7dd84a9",\n "build_date" : "2021-05-28T17:40:59.346932922Z",\n "build_snapshot" : false,\n "lucene_version" : "8.8.2",\n "minimum_wire_compatibility_version" : "6.8.0",\n "minimum_index_compatibility_version" : "6.0.0-beta1"\n },\n "tagline" : "You Know, for Search"\n}\n'
So this notebook can connect to the Elasticsearch instance. Note that I am using host.docker.internal
in my URL in that get request. This is because I have set up my Jupyter up in Docker as well (details at: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html). If you have installed an Anaconda instance or something, this URL would be localhost
rather than host.docker.internal
Now, using our Python elasticsearch library, let's create a connection to Elasticsearch
# Note "host.docker.internal" might be "localhost" if you are running an Anaconda version of Jupyter
es = Elasticsearch(hosts=[{"host": "host.docker.internal", "port": 9200}],
connection_class=RequestsHttpConnection, max_retries=30,
retry_on_timeout=True, request_timeout=30)
Let's create an index (think of this as a db) and put some data into it:
#index some test data
es.index(index='testing-index', doc_type='test', id=1, body={'test': 'test'})
{'_index': 'testing-index', '_type': 'test', '_id': '1', '_version': 3, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 8, '_primary_term': 1}
We can go to http://localhost:9200/testing-index/_search?pretty=true&q=*:*
and see the data now exists in Elasticsearch. Or we could just retreive it using the Python elastic search library
res = es.get(index= "testing-index", id=1)
res
{'_index': 'testing-index', '_type': '_doc', '_id': '1', '_version': 3, '_seq_no': 8, '_primary_term': 1, 'found': True, '_source': {'test': 'test'}}
So that works. Let's delete it now:
es.delete(index='testing-index', doc_type='test', id=1)
{'_index': 'testing-index', '_type': 'test', '_id': '1', '_version': 4, 'result': 'deleted', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 9, '_primary_term': 1}
Extracting Text from Jupyter Notebooks
Now let's do something a little more substantial. I have a folder full of Jupyter Notebooks and I always need code from one or another. So let's create a function to extract all the text from the notebooks. First I need a list of names of the Jupyter Notebooks from the directory in which they are located:
pathToLocationJupyterNotebookFiles = "../work/HTMNotebooks/"
jupyterNotebooksFileNames = glob.glob(pathToLocationJupyterNotebookFiles + './*.ipynb')
jupyterNotebooksFileNames
['../work/HTMNotebooks/./HTMTest.ipynb', '../work/HTMNotebooks/./HTM_Overview_0.ipynb', '../work/HTMNotebooks/./HTM_Overview_1.ipynb', '../work/HTMNotebooks/./HTM_Overview_10.ipynb', '../work/HTMNotebooks/./HTM_Overview_11.ipynb', '../work/HTMNotebooks/./HTM_Overview_2.ipynb', '../work/HTMNotebooks/./HTM_Overview_3.ipynb', '../work/HTMNotebooks/./HTM_Overview_4.ipynb', '../work/HTMNotebooks/./HTM_Overview_5.ipynb', '../work/HTMNotebooks/./HTM_Overview_6.ipynb', '../work/HTMNotebooks/./HTM_Overview_7.ipynb', '../work/HTMNotebooks/./HTM_Overview_8.ipynb', '../work/HTMNotebooks/./HTM_Overview_9.ipynb']
Now, let's create function to extract for the notebooks, and then we will just iterate each of the notebooks to extrat the code:
NB_VERSION = 4
def extractTextFromNotebook(notebook_str):
formatted = nbf.read(notebook_str, as_version=NB_VERSION)
text = []
for cell in formatted.get('cells', []):
if 'source' in cell and 'cell_type' in cell:
if cell['cell_type'] == 'code' or cell['cell_type'] == 'markdown':
text.append(cell['source'])
return(text)
textFromNotebooks = [extractTextFromNotebook(jupyterNotebooksFileNames[i]) for i in range(len(jupyterNotebooksFileNames))]
We get back a list of lists of all the notebooks. Let's check we are on the right trck by looking at, say, the fith item in the first notebook:
textFromNotebooks[1][5]
'from htm.bindings.sdr import SDR, Metrics\nfrom htm.encoders.rdse import RDSE, RDSE_Parameters\nfrom htm.encoders.date import DateEncoder\nfrom htm.bindings.algorithms import SpatialPooler\nfrom htm.bindings.algorithms import TemporalMemory\nfrom htm.algorithms.anomaly_likelihood import AnomalyLikelihood \nfrom htm.bindings.algorithms import Predictor'
Now let's iterate through all those notebooks converted to text, and push them into Elasticsearch. Elasticsearch will want something JSON like so that's is what we will give it. I notice it seems to return an empty object but still appears to work:
elasticDBName = "j-notebook-cell-search-index"
def writeTextCellsToElasticSearchDB(doc, notebookFilePath):
for i in range(len(doc)):
cellDict = {}
cellDict['text'] = doc[i],
cellDict['noteBookFilePath'] = notebookFilePath
es.index(index= elasticDBName, doc_type= 'cell', body=cellDict)
[writeTextCellsToElasticSearchDB(textFromNotebooks[i], jupyterNotebooksFileNames[i]) for i in range(len(jupyterNotebooksFileNames))]
[None, None, None, None, None, None, None, None, None, None, None, None, None]
Searching Notebooks
So now all the data is in Elasticsearch. Now we want to search it. There are three options to do this:
http://localhost:9200/testing-index/_search?pretty=true&q=*:*
Option 1: Using Python
This is my preferred way of doing it. Here are some handy getting started searches you can do to look through your data that has been put into Elasticsearch:
# Grab a particular record - note I just got the ID from http://localhost:9200/j-notebook-cell-search-index/_search?pretty=true&q=*:*
# now the elastic search index is up and running
es.get(index=elasticDBName,
doc_type="_doc", id = "VFtpKHoB3T1ThL6Sx1Yg")
{'_index': 'j-notebook-cell-search-index', '_type': '_doc', '_id': 'VFtpKHoB3T1ThL6Sx1Yg', '_version': 1, '_seq_no': 0, '_primary_term': 1, 'found': True, '_source': {'text': ['import csv\nimport datetime\nimport os\nimport numpy as np\nimport random\nimport math\n\nfrom htm.bindings.sdr import SDR, Metrics\nfrom htm.encoders.rdse import RDSE, RDSE_Parameters\nfrom htm.encoders.date import DateEncoder\nfrom htm.bindings.algorithms import SpatialPooler\nfrom htm.bindings.algorithms import TemporalMemory\nfrom htm.algorithms.anomaly_likelihood import AnomalyLikelihood #FIXME use TM.anomaly instead, but it gives worse results than the py.AnomalyLikelihood now\nfrom htm.bindings.algorithms import Predictor'], 'noteBookFilePath': '../work/HTMNotebooks/./HTMTest.ipynb'}}
It supports all kind of queries to match text, partial match, etc. Here is another example, the use case here is that I know I have a notebook where I have done some work on Baltimore Crime Data, but can't remember where. So will put in the prefix "crim" and let Elasticsearch do its thing.
Note that things can get a bit messy, so I would advise you to keep you query in a seperate Python dictionary, and then just pass that into the search:
q = {
"query": {
"prefix": {
"text": "crim"
}
}}
es.search(index=elasticDBName,
body = q)
{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'j-notebook-cell-search-index', '_type': 'cell', '_id': 'OltpKHoB3T1ThL6Szlec', '_score': 1.0, '_source': {'text': ['To explore this, let\'s use some data. There is some really interesting data that will turn up in episode\'s 7 and 8 in the context of the Spatial Pooler, that has some interesting info, but for now, let\'s use Baltimore Crime Data. This data has nice coverage across a number of data points, descriptive names, some categorical variables, the footprint isn\'t too big but it gives us a nice sample of 96k records\n\nInformation available <a href="https://data.baltimorecity.gov/datasets/baltimore::part1-crime-2015-to-2016/about">https://data.baltimorecity.gov/datasets/baltimore::part1-crime-2015-to-2016/about</a>\n'], 'noteBookFilePath': '../work/HTMNotebooks/./HTM_Overview_5.ipynb'}}, {'_index': 'j-notebook-cell-search-index', '_type': 'cell', '_id': 'PVtpKHoB3T1ThL6Szlez', '_score': 1.0, '_source': {'text': ['df = pd.read_csv("./data/Part1_Crime_2015_to__2016.csv")\ndf.CrimeDateTime = df.CrimeDateTime.str.slice(0, -8)\ndf.CrimeDateTime= pd.to_datetime(df.CrimeDateTime)\ndf[\'weekdayCodeWhenEventReported\'] = [d.weekday() for d in df.CrimeDateTime]\ndf[\'monthCodeWhenEventReported\'] = df[\'CrimeDateTime\'].dt.month\ndf[\'seasonCodeWhenEventReported\'] = (df[\'CrimeDateTime\'].dt.month - 1) % 4\ndf[\'isWeekend\'] = np.where(df.weekdayCodeWhenEventReported > 4, True, False)\ndf = df.drop(\'VRIName\', axis=1)\ndf = df.drop(\'HashedRecord\', axis=1)\ndf = df.drop(\'ObjectId\', axis=1)'], 'noteBookFilePath': '../work/HTMNotebooks/./HTM_Overview_5.ipynb'}}, {'_index': 'j-notebook-cell-search-index', '_type': 'cell', '_id': '0ltpKHoB3T1ThL6S01cp', '_score': 1.0, '_source': {'text': ["<h2>HTM Overview 9: Boosting</h2>\n\nSo now let's start working with. The first thing we want to do is create a Scalar Encoder\n\nCrime data"], 'noteBookFilePath': '../work/HTMNotebooks/./HTM_Overview_9.ipynb'}}]}}
Option 2: Using Kibana
Kibana's cool, but after a while I did get annoyed at the UI But if you are going to use it
http://localhost:5601/app/home#/
which should be up and runninghttp://localhost:5601/app/management
http://localhost:5601/app/management/kibana/indexPatterns
http://localhost:5601/app/management/kibana/indexPatterns/create
http://localhost:5601/app/home#/
and choose "Discover" from the left hand indexFrom there you will have a search box and some filters, and all kinds of cool things you can check.
Enjoy!