Simple-Jekyll-Search

Build Status dependencies Status devDependencies Status

A JavaScript library to add search functionality to any Jekyll blog.

Use case

You have a blog, built with Jekyll, and want a lightweight search functionality on your blog, purely client-side?

No server configurations or databases to maintain.

Just 5 minutes to have a fully working searchable blog.


Installation

npm

npm install simple-jekyll-search

Getting started

Create search.json

Place the following code in a file called search.json in the root of your Jekyll blog. (You can also get a copy from here)

This file will be used as a small data source to perform the searches on the client side:

---
layout: none
---
[
  
]

Preparing the plugin

Add DOM elements

SimpleJekyllSearch needs two DOM elements to work:

Give me the code

Here is the code you can use with the default configuration:

You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)

For example in _layouts/default.html:

<!-- HTML elements for search -->
<input type="text" id="search-input" placeholder="Search blog posts..">
<ul id="results-container"></ul>

<!-- or without installing anything -->
<script src="https://unpkg.com/simple-jekyll-search@latest/dest/simple-jekyll-search.min.js"></script>

Usage

Customize SimpleJekyllSearch by passing in your configuration options:

var sjs = SimpleJekyllSearch({
  searchInput: document.getElementById('search-input'),
  resultsContainer: document.getElementById('results-container'),
  json: '/search.json'
})

returns { search }

A new instance of SimpleJekyllSearch returns an object, with the only property search.

search is a function used to simulate a user input and display the matching results. 

E.g.:

var sjs = SimpleJekyllSearch({ ...options })
sjs.search('Hello')

💡 it can be used to filter posts by tags or categories!

Options

Here is a list of the available options, usage questions, troubleshooting & guides.

searchInput (Element) [required]

The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.

resultsContainer (Element) [required]

The container element in which the search results should be rendered in. Typically a <ul>.

json (String|JSON) [required]

You can either pass in an URL to the search.json file, or the results in form of JSON directly, to save one round trip to get the data.

searchResultTemplate (String) [optional]

The template of a single rendered search result.

The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.

E.g.

The template

var sjs = SimpleJekyllSearch({
  searchInput: document.getElementById('search-input'),
  resultsContainer: document.getElementById('results-container'),
  json: '/search.json',
  searchResultTemplate: '<li><a href="https://biologicalmodeling.org{url}">{title}</a></li>'
})

will render to the following

<li><a href="/jekyll/update/2014/11/01/welcome-to-jekyll.html">Welcome to Jekyll!</a></li>

If the search.json contains this data

[
    {
      "title"    : "Welcome to Jekyll!",
      "category" : "",
      "tags"     : "",
      "url"      : "/jekyll/update/2014/11/01/welcome-to-jekyll.html",
      "date"     : "2014-11-01 21:07:22 +0100"
    }
]

templateMiddleware (Function) [optional]

A function that will be called whenever a match in the template is found.

It gets passed the current property name, property value, and the template.

If the function returns a non-undefined value, it gets replaced in the template.

This can be potentially useful for manipulating URLs etc.

Example:

SimpleJekyllSearch({
  ...
  templateMiddleware: function(prop, value, template) {
    if (prop === 'bar') {
      return value.replace(/^\//, '')
    }
  }
  ...
})

See the tests for an in-depth code example

sortMiddleware (Function) [optional]

A function that will be used to sort the filtered results.

It can be used for example to group the sections together.

Example:

SimpleJekyllSearch({
  ...
  sortMiddleware: function(a, b) {
    var astr = String(a.section) + "-" + String(a.caption);
    var bstr = String(b.section) + "-" + String(b.caption);
    return astr.localeCompare(bstr)
  }
  ...
})

noResultsText (String) [optional]

The HTML that will be shown if the query didn’t match anything.

limit (Number) [optional]

You can limit the number of posts rendered on the page.

fuzzy (Boolean) [optional]

Enable fuzzy search to allow less restrictive matching.

exclude (Array) [optional]

Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).

success (Function) [optional]

A function called once the data has been loaded.

debounceTime (Number) [optional]

Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no debounceTime (milliseconds) is provided a search will be triggered on each keystroke.


If search isn’t working due to invalid JSON

For example: in search.json, replace

"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[  {% for post in site.posts %}    {      "title"    : "{{ post.title | escape }}",      "category" : "{{ post.category }}",      "tags"     : "{{ post.tags | join: ', ' }}",      "url"      : "{{ site.baseurl }}{{ post.url }}",      "date"     : "{{ post.date }}"    } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in  **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById('search-input'),  resultsContainer: document.getElementById('results-container'),  json: '/search.json'})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search('Hello')```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting & guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById('search-input'),  resultsContainer: document.getElementById('results-container'),  json: '/search.json',  searchResultTemplate: '{title}'})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[    {      "title"    : "Welcome to Jekyll!",      "category" : "",      "tags"     : "",      "url"      : "/jekyll/update/2014/11/01/welcome-to-jekyll.html",      "date"     : "2014-11-01 21:07:22 +0100"    }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({  ...  templateMiddleware: function(prop, value, template) {    if (prop === 'bar') {      return value.replace(/^\//, '')    }  }  ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({  ...  sortMiddleware: function(a, b) {    var astr = String(a.section) + "-" + String(a.caption);    var bstr = String(b.section) + "-" + String(b.caption);    return astr.localeCompare(bstr)  }  ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn't match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn't working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json"content": "{{ page.content | strip_html | strip_newlines }}"```with```json"content": "{{ page.content | strip_html | strip_newlines | remove_chars | escape }}"```If this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js"content": {{ page.content | jsonify }}```**Note: you don't need to use quotes `"` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[  {% for post in site.posts %}    {      "title"    : "{{ post.title | escape }}",      "category" : "{{ post.category }}",      "tags"     : "{{ post.tags | join: ', ' }}",      "url"      : "{{ site.baseurl }}{{ post.url }}",      "date"     : "{{ post.date }}",      "content"  : "{{ post.content | strip_html | strip_newlines }}"    } {% unless forloop.last %},{% endunless %}  {% endfor %}  ,  {% for page in site.pages %}   {     {% if page.title != nil %}        "title"    : "{{ page.title | escape }}",        "category" : "{{ page.category }}",        "tags"     : "{{ page.tags | join: ', ' }}",        "url"      : "{{ site.baseurl }}{{ page.url }}",        "date"     : "{{ page.date }}",        "content"  : "{{ page.content | strip_html | strip_newlines }}"     {% endif %}   } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)> [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"

with

"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[  {% for post in site.posts %}    {      &quot;title&quot;    : &quot;{{ post.title | escape }}&quot;,      &quot;category&quot; : &quot;{{ post.category }}&quot;,      &quot;tags&quot;     : &quot;{{ post.tags | join: &#39;, &#39; }}&quot;,      &quot;url&quot;      : &quot;{{ site.baseurl }}{{ post.url }}&quot;,      &quot;date&quot;     : &quot;{{ post.date }}&quot;    } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in  **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById(&#39;search-input&#39;),  resultsContainer: document.getElementById(&#39;results-container&#39;),  json: &#39;/search.json&#39;})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search(&#39;Hello&#39;)```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting &amp; guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById(&#39;search-input&#39;),  resultsContainer: document.getElementById(&#39;results-container&#39;),  json: &#39;/search.json&#39;,  searchResultTemplate: &#39;{title}&#39;})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[    {      &quot;title&quot;    : &quot;Welcome to Jekyll!&quot;,      &quot;category&quot; : &quot;&quot;,      &quot;tags&quot;     : &quot;&quot;,      &quot;url&quot;      : &quot;/jekyll/update/2014/11/01/welcome-to-jekyll.html&quot;,      &quot;date&quot;     : &quot;2014-11-01 21:07:22 +0100&quot;    }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({  ...  templateMiddleware: function(prop, value, template) {    if (prop === &#39;bar&#39;) {      return value.replace(/^\//, &#39;&#39;)    }  }  ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({  ...  sortMiddleware: function(a, b) {    var astr = String(a.section) + &quot;-&quot; + String(a.caption);    var bstr = String(b.section) + &quot;-&quot; + String(b.caption);    return astr.localeCompare(bstr)  }  ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn&#39;t match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn&#39;t working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json&quot;content&quot;: &quot;{{ page.content | strip_html | strip_newlines }}&quot;```with```json&quot;content&quot;: &quot;{{ page.content | strip_html | strip_newlines | remove_chars | escape }}&quot;```If this doesn&#39;t work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js&quot;content&quot;: {{ page.content | jsonify }}```**Note: you don&#39;t need to use quotes `&quot;` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[  {% for post in site.posts %}    {      &quot;title&quot;    : &quot;{{ post.title | escape }}&quot;,      &quot;category&quot; : &quot;{{ post.category }}&quot;,      &quot;tags&quot;     : &quot;{{ post.tags | join: &#39;, &#39; }}&quot;,      &quot;url&quot;      : &quot;{{ site.baseurl }}{{ post.url }}&quot;,      &quot;date&quot;     : &quot;{{ post.date }}&quot;,      &quot;content&quot;  : &quot;{{ post.content | strip_html | strip_newlines }}&quot;    } {% unless forloop.last %},{% endunless %}  {% endfor %}  ,  {% for page in site.pages %}   {     {% if page.title != nil %}        &quot;title&quot;    : &quot;{{ page.title | escape }}&quot;,        &quot;category&quot; : &quot;{{ page.category }}&quot;,        &quot;tags&quot;     : &quot;{{ page.tags | join: &#39;, &#39; }}&quot;,        &quot;url&quot;      : &quot;{{ site.baseurl }}{{ page.url }}&quot;,        &quot;date&quot;     : &quot;{{ page.date }}&quot;,        &quot;content&quot;  : &quot;{{ page.content | strip_html | strip_newlines }}&quot;     {% endif %}   } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)&gt; [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"

If this doesn’t work when using Github pages you can try jsonify to make sure the content is json compatible:

"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)\n\n[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)\n[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)\n[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)\n\nA JavaScript library to add search functionality to any Jekyll blog.\n\n## Use case\n\nYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?\n\n*No server configurations or databases to maintain*.\n\nJust **5 minutes** to have a **fully working searchable blog**.\n\n---\n\n## Installation\n\n### npm\n\n```sh\nnpm install simple-jekyll-search\n```\n\n## Getting started\n\n### Create `search.json`\n\nPlace the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))\n\nThis file will be used as a small data source to perform the searches on the client side:\n\n```yaml\n---\nlayout: none\n---\n[\n  {% for post in site.posts %}\n    {\n      \"title\"    : \"{{ post.title | escape }}\",\n      \"category\" : \"{{ post.category }}\",\n      \"tags\"     : \"{{ post.tags | join: ', ' }}\",\n      \"url\"      : \"{{ site.baseurl }}{{ post.url }}\",\n      \"date\"     : \"{{ post.date }}\"\n    } {% unless forloop.last %},{% endunless %}\n  {% endfor %}\n]\n```\n\n\n## Preparing the plugin\n\n### Add DOM elements\n\nSimpleJekyllSearch needs two `DOM` elements to work:\n\n- a search input field\n- a result container to display the results\n\n#### Give me the code\n\nHere is the code you can use with the default configuration:\n\nYou need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)\n\nFor example in  **_layouts/default.html**:\n\n```html\n<!-- HTML elements for search -->\n<input type=\"text\" id=\"search-input\" placeholder=\"Search blog posts..\">\n<ul id=\"results-container\"></ul>\n\n<!-- or without installing anything -->\n<script src=\"https://unpkg.com/simple-jekyll-search@latest/dest/simple-jekyll-search.min.js\"></script>\n```\n\n\n## Usage\n\nCustomize SimpleJekyllSearch by passing in your configuration options:\n\n```js\nvar sjs = SimpleJekyllSearch({\n  searchInput: document.getElementById('search-input'),\n  resultsContainer: document.getElementById('results-container'),\n  json: '/search.json'\n})\n```\n\n### returns { search }\n\nA new instance of SimpleJekyllSearch returns an object, with the only property `search`.\n\n`search` is a function used to simulate a user input and display the matching results. \n\nE.g.:\n\n```js\nvar sjs = SimpleJekyllSearch({ ...options })\nsjs.search('Hello')\n```\n\n💡 it can be used to filter posts by tags or categories!\n\n## Options\n\nHere is a list of the available options, usage questions, troubleshooting & guides.\n\n### searchInput (Element) [required]\n\nThe input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.\n\n\n### resultsContainer (Element) [required]\n\nThe container element in which the search results should be rendered in. Typically a `<ul>`.\n\n\n### json (String|JSON) [required]\n\nYou can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.\n\n\n### searchResultTemplate (String) [optional]\n\nThe template of a single rendered search result.\n\nThe templating syntax is very simple: You just enclose the properties you want to replace with curly braces.\n\nE.g.\n\nThe template\n\n```js\nvar sjs = SimpleJekyllSearch({\n  searchInput: document.getElementById('search-input'),\n  resultsContainer: document.getElementById('results-container'),\n  json: '/search.json',\n  searchResultTemplate: '<li><a href=\"{{ site.url }}{url}\">{title}</a></li>'\n})\n```\n\nwill render to the following\n\n```html\n<li><a href=\"/jekyll/update/2014/11/01/welcome-to-jekyll.html\">Welcome to Jekyll!</a></li>\n```\n\nIf the `search.json` contains this data\n\n```json\n[\n    {\n      \"title\"    : \"Welcome to Jekyll!\",\n      \"category\" : \"\",\n      \"tags\"     : \"\",\n      \"url\"      : \"/jekyll/update/2014/11/01/welcome-to-jekyll.html\",\n      \"date\"     : \"2014-11-01 21:07:22 +0100\"\n    }\n]\n```\n\n\n### templateMiddleware (Function) [optional]\n\nA function that will be called whenever a match in the template is found.\n\nIt gets passed the current property name, property value, and the template.\n\nIf the function returns a non-undefined value, it gets replaced in the template.\n\nThis can be potentially useful for manipulating URLs etc.\n\nExample:\n\n```js\nSimpleJekyllSearch({\n  ...\n  templateMiddleware: function(prop, value, template) {\n    if (prop === 'bar') {\n      return value.replace(/^\\//, '')\n    }\n  }\n  ...\n})\n```\n\nSee the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example\n\n### sortMiddleware (Function) [optional]\n\nA function that will be used to sort the filtered results.\n\nIt can be used for example to group the sections together.\n\nExample:\n\n```js\nSimpleJekyllSearch({\n  ...\n  sortMiddleware: function(a, b) {\n    var astr = String(a.section) + \"-\" + String(a.caption);\n    var bstr = String(b.section) + \"-\" + String(b.caption);\n    return astr.localeCompare(bstr)\n  }\n  ...\n})\n```\n\n### noResultsText (String) [optional]\n\nThe HTML that will be shown if the query didn't match anything.\n\n\n### limit (Number) [optional]\n\nYou can limit the number of posts rendered on the page.\n\n\n### fuzzy (Boolean) [optional]\n\nEnable fuzzy search to allow less restrictive matching.\n\n### exclude (Array) [optional]\n\nPass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).\n\n### success (Function) [optional]\n\nA function called once the data has been loaded.\n\n### debounceTime (Number) [optional]\n\nLimit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.\n\n---\n\n## If search isn't working due to invalid JSON\n\n- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.\n\nFor example: in search.json, replace\n\n```json\n\"content\": \"{{ page.content | strip_html | strip_newlines }}\"\n```\n\nwith\n\n```json\n\"content\": \"{{ page.content | strip_html | strip_newlines | remove_chars | escape }}\"\n```\n\nIf this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:\n\n```js\n\"content\": {{ page.content | jsonify }}\n```\n\n**Note: you don't need to use quotes `\"` in this since `jsonify` automatically inserts them.**\n\n\n## Enabling full-text search\n\nReplace `search.json` with the following code:\n\n```yaml\n---\nlayout: none\n---\n[\n  {% for post in site.posts %}\n    {\n      \"title\"    : \"{{ post.title | escape }}\",\n      \"category\" : \"{{ post.category }}\",\n      \"tags\"     : \"{{ post.tags | join: ', ' }}\",\n      \"url\"      : \"{{ site.baseurl }}{{ post.url }}\",\n      \"date\"     : \"{{ post.date }}\",\n      \"content\"  : \"{{ post.content | strip_html | strip_newlines }}\"\n    } {% unless forloop.last %},{% endunless %}\n  {% endfor %}\n  ,\n  {% for page in site.pages %}\n   {\n     {% if page.title != nil %}\n        \"title\"    : \"{{ page.title | escape }}\",\n        \"category\" : \"{{ page.category }}\",\n        \"tags\"     : \"{{ page.tags | join: ', ' }}\",\n        \"url\"      : \"{{ site.baseurl }}{{ page.url }}\",\n        \"date\"     : \"{{ page.date }}\",\n        \"content\"  : \"{{ page.content | strip_html | strip_newlines }}\"\n     {% endif %}\n   } {% unless forloop.last %},{% endunless %}\n  {% endfor %}\n]\n```\n\n\n\n## Development\n\n- `npm install`\n- `npm test`\n\n#### Acceptance tests\n\n```bash\ncd example; jekyll serve\n\n# in another tab\n\nnpm run cypress -- run\n```\n\n## Contributors\n\nThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)\n\n> [@daviddarnes](https://github.com/daviddarnes)\n[@XhmikosR](https://github.com/XhmikosR)\n[@PeterDaveHello](https://github.com/PeterDaveHello)\n[@mikeybeck](https://github.com/mikeybeck)\n[@egladman](https://github.com/egladman)\n[@midzer](https://github.com/midzer)\n[@eduardoboucas](https://github.com/eduardoboucas)\n[@kremalicious](https://github.com/kremalicious)\n[@tibotiber](https://github.com/tibotiber)\nand many others!\n\n## Stargazers over time\n\n[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)\n"

Note: you don’t need to use quotes " in this since jsonify automatically inserts them.

Replace search.json with the following code:

---
layout: none
---
[
  
  ,
  
   {
     
        "title"    : "Page Not Found",
        "category" : "",
        "tags"     : "",
        "url"      : "/404.html",
        "date"     : "",
        "content"  : "Sorry, but the page you were trying to view does not exist  please return to our course homepage."
     
   } ,
  
   {
     
        "title"    : "VMD Tutorial",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/VMDTutorial",
        "date"     : "",
        "content"  : "This is a short tutorial on how to use VMD to visualize molecules and perform some basic analysis. Before you start, make sure to have downloaded and installed VMD.Loading MoleculesThese steps will be on how to load molecules into VMD. We will use the example of 6vw1.Download the protein structure of 6vw1 from the protein data bank.Next we can launch VMD and load the molecule into the program. In VMD Main, navigate to File &gt; New Molecule. Click Browse, select the molecule (6vw1.pdb) and click Load.The molecule should now be listed in VMD Main as well as the visualization in the OpenGL Display.Section to be movedGlycansFor VMD, there is no specific keyword to select glycans. A workaround is to use the keywords: “not protein and not water”. To recreate the basic VMD visualizations from the module of the open-state (6vyb) of SARS-CoV-2 Spike, use the following representations. (For the protein chains, use Glass3 for Material).The end result should look like this: Visualization Exercise Try to recreate the visualization of Hotspot31 for SARS-CoV-2 (same molecule as the tutorial). The important residues and their corresponding colors are listed on the left. "
     
   } ,
  
   {
     
        "title"    : "Ab initio Protein Structure Prediction",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/ab_initio",
        "date"     : "",
        "content"  : "Modeling ab initio structure prediction as an exploration problemPredicting a protein’s structure using only its amino acid sequence is called ab initio structure prediction (ab initio means “from the beginning” in Latin). Although many different algorithms have been developed for ab initio protein structure through the years, these algorithms all find themselves solving a similar problem.Biochemical research has contributed to the development of scoring functions called force fields that use the physicochemical properties of amino acids introduced in the previous lesson to compute the potential energy of a candidate protein shape. For a given choice of force field, we can think of ab initio structure prediction as solving the following problem: given a primary structure of a polypeptide, find its tertiary structure having minimum energy. This problem exemplifies an optimization problem, in which we are seeking an object maximizing or minimizing some function subject to constraints.This formulation of protein structure prediction may not strike you as similar to anything that we have done before in this course. However, consider once more a bacterium exploring an environment for food. Every point in the bacterium’s “search space” is characterized by a concentration of attractant, and the bacterium’s goal is to reach the point of maximum attractant concentration.In the case of structure prediction, our search space is the collection of all possible conformations of a given protein, and each point in this search space is characterized by the energy of the conformation at the point. Just as we imagined a ball rolling down a hill to find lower energy, we can now imagine exploring the search space of all conformations of a polypeptide to find the conformation having lowest energy. The general problem of exploring a search space to find a point minimizing some function is illustrated in the figure below, in which the height of each point represents the function value, and our goal is to find the lowest point in the space.Optimization problems can be thought of as exploring a search space, visualized as a landscape, in which the height of a point is the value of the function that we wish to optimize. Finding the highest or lowest point in this landscape corresponds to maximizing or minimizing the function over the search space. Image courtesy: David Beamish.A local search algorithm for ab initio structure predictionNow that we have conceptualized finding the most stable protein structure as exploring a search space, we turn to developing an algorithm to explore this space. Our idea is to use an approach similar to E. coli’s clever exploration algorithm from a previous module: over a sequence of steps, we will consult a collection of nearby points in the space, and then move in the “direction” in which the energy function decreases by the most. This approach belongs to a broad category of optimization algorithms called local search algorithms.Adapting this exploration algorithm to protein structure prediction requires us to develop a notion of what it means to consider the points “nearby” a given conformation in a protein search space. Many ab initio algorithms will start at an arbitrary initial conformation and then make a variety of minor modifications to that structure (i.e., nearby points in the space), updating the current conformation to the modification that produces the greatest decrease in free energy. These algorithms then iterate the process of changing the protein structure to have greatest decrease in potential energy. They terminate the search after reaching a structure for which no changes to the structure reduce the free energy.Yet returning to the chemotaxis analogy, imagine what happens if we were to place many small sugar cubes and one large sugar cube into the bacterium’s environment. The bacterium will sense the gradient not of the large sugar cube but of its nearest attractant. Because the smaller food sources outnumber the larger food source, the bacterium will likely not reach the point of greatest attractant concentration. In bacterial exploration, this is a feature, not a bug; if the bacterium exhausts one food source, then it will just move to another. But in protein structure prediction, we should be wary of returning a protein structure that does not have minimum free energy but does have the property that no “nearby” structures have lower energy.In general, an object in a search space that has a smaller value of the optimization function than neighboring points is called a local minimum. Returning to our landscape analogy, our search space may have many valleys, but we would like the one that is as low as possible over the entire space, called a global minimum.STOP: Do you see any ways in which we could improve our local search approach for structure prediction to avoid winding up in a local minimum?Fortunately, researchers applying local search algorithms have devised a number of ways to avoid local minima, and two are so fundamental as to be worth mentioning here. First, because the initial conformation has a huge influence on the final conformation, we could run the algorithm multiple times with different starting conformations. This is analogous to allowing multiple bacteria to explore their environment at different starting points. Second, every time we reach a local minimum, we could allow ourselves to change the structure with some probability, thus giving our local search algorithm the chance to “bounce” out of the local minimum. Once again, randomized algorithms help us solve problems!Applying an ab initio algorithm to a protein sequenceTo run an ab initio structure prediction algorithm on a real protein, we will use a software resource called QUARK, which is built upon the ideas discussed in the previous section, with some added features. For example, its algorithm applies a combination of multiple scoring functions to look for the lowest energy conformation across all of these functions.Levinthal’s paradox means that the search space of all possible structures for a protein is so large that accurately predicting large protein structures with ab initio modeling remains very difficult. As such, QUARK limits us to proteins with at most 200 amino acids, and so we will run it only on human hemoglobin subunit alpha.Visit tutorialToward a faster approach for protein structure predictionThe figure below shows the top five predicted human hemoglobin subunit alpha structures returned by QUARK as well as the protein’s experimentally verified structure, and an average of these six structures. It takes a keen eye to see any differences between these structures. We conclude that although ab initio prediction is slow, it is nevertheless accurate.The experimentally verified protein structure of human hemoglobin subunit alpha (top left) along with five models of this protein produced by QUARK from the protein’s primary sequence. We can see how close all five models are to the experimentally verified structure, as shown in the superimposition of all six structures at right.Yet we also wonder if we can speed up our structure prediction algorithms so that they will scale to a larger protein like the SARS-CoV-2 spike protein. In the next lesson, we will learn about another type of protein structure prediction that uses a database of known structures.STOP: What known protein structure(s) would you first want to consult when studying the SARS-CoV-2 spike protein?Next lesson"
     
   } ,
  
   {
     
        "title"    : "Protein Structure Comparison",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/accuracy",
        "date"     : "",
        "content"  : "In this lesson, we will compare the results of the SARS-CoV-2 spike protein prediction from the previous lesson against each other and against the protein’s empirically validated structure. To do so, we need a method of comparing two structures.Comparing two shapes with the Kabsch algorithmComparing two protein structures is intrinsically similar to comparing two shapes, such as those shown in the figure below.STOP: Consider the two shapes in the figure below. How similar are they?If you think you have a good handle on comparing the above two shapes, then it is because humans have very highly evolved eyes and brains. As we will see in the next module, training a computer to detect and differentiate objects is more difficult than you think!We would like to develop a distance function d(S, T) quantifying how different two shapes S and T are. If S and T are the same, then d(S, T) should be equal to zero; the more different S and T become, the larger d should become.You may have noticed that the two shapes in the preceding figure are, in fact, identical. To demonstrate that this is true, we can first move the red shape to superimpose it over the blue shape, then flip the red shape, and finally rotate it so that its boundary coincides with the blue shape, as shown in the animation below. In general, if a shape S can be translated, flipped, and/or rotated to produce shape T, then S and T are the same shape, and so d(S, T) should be equal to zero. The question is what d(S, T) should be if S and T are not the same shape.We can transform the red shape into the blue shape by translating it, flipping it, and then rotating it.Our idea for defining d(S, T), then, is first to translate, flip, and rotate S so that it resembles T “as much as possible” to give us a fair comparison. Once we have done so, we will devise a metric to quantify the difference between the two shapes that will represent d(S, T).We first translate S to have the same center of mass (or center of mass) as T. The center of mass of S is found at the point (xS, yS) such that xS and yS are the respective averages of the x-coordinates and y-coordinates on the boundary of S.The center of mass of some shapes, like the arc in the preceding example, can be determined mathematically. But for irregular shapes, we will first sample n points from the boundary of S and then estimate xS and yS as the average of all the respective x- and y-coordinates from the sampled points.After finding the center of mass of the two shapes S and T that we wish to compare, we translate S so that it has the same center of mass as T. We then wish to find the rotation of S, possibly along with a flip as well, that makes the shape resemble T as much as possible.Imagine that we have found the desired rotation; we are now ready to define d(S, T) in the following way. We sample n points along the boundary of each shape, converting S and T into vectors s = (s1, …, sn) and t = (t1, …, tn), where si is the i-th point on the boundary of S. The root mean square deviation (RMSD) between the two shapes is the square root of the average squared distance between corresponding points in the vectors,\[\text{RMSD}(\mathbf{s}, \mathbf{t}) = \sqrt{\dfrac{1}{n} \cdot (d(s_1, t_1)^2 + d(s_2, t_2)^2 + \cdots + d(s_n, t_n)^2)} \,.\]In this formula, d(si, ti) is the distance between the points si and ti.Note: RMSD is a very commonly used approach across data science when measuring the differences between two vectors.For an example two-dimensional RMSD calculation, consider the figure below, which shows two shapes with four points sampled from each.Two shapes with four points sampled from each.The distances between corresponding points in this figure are equal to \(\sqrt{2}\), 1, 4, and \(\sqrt{2}\). As a result, we compute the RMSD as\[\begin{align*}\text{RMSD}(\mathbf{s}, \mathbf{t}) &amp; = \sqrt{\dfrac{1}{4} \cdot (\sqrt{2}^2 + 1^2 + 2^2 + \sqrt{2}^2)} \\&amp; = \sqrt{\dfrac{1}{4} \cdot 9}\\&amp; = \sqrt{\dfrac{9}{4}}\\&amp; = \dfrac{3}{2}\end{align*}\]STOP: Do you see any issues with using RMSD to compare two shapes?Even if we assume that two shapes have already been overlapped and rotated appropriately, we still should ensure that we sample enough points to give a good approximation of how different the shapes are. Consider a circle inscribed within a square, as shown in the figure below. If we happened to sample only the four points indicated, then we would sample the same points for each shape and conclude that the RMSD between these two shapes is equal to zero. Fortunately, this issue is easily resolved by making sure to sample enough points from the shape boundaries.A circle inscribed within a square. Sampling of the four points where the shapes intersect will give a flawed estimate of zero for the RMSD.However, we have still assumed that we already rotated (and possibly flipped) S to be as “similar” to T as possible. In practice, after superimposing S and T to have the same center of mass, we would like to find the flip and/or rotation of S that minimizes the RMSD between our vectorizations of S and T over all possible ways of flipping and rotating S. It is this minimum RMSD that we define as d(S, T).The best way of rotating and flipping S so as to minimize the RMSD between the resulting vectors s and t can be found with a method called the Kabsch algorithm. Explaining this algorithm requires some advanced linear algebra and is beyond the scope of our work but is described here.PDB format represents a protein’s structureThe Kabsch algorithm offers a compelling way to determine the similarity of two protein structures. We can convert a protein containing n amino acids into a vector of length n by selecting a single representative point from each amino acid. We typically use the alpha carbon, the amino acid’s centrally located carbon atom.Whether a protein structure is experimentally validated or predicted by an algorithm, the structure is often represented in a unified file format used by the PDB called .pdb format. In this format (see the figure below), each atom in the protein is labeled according to several different characteristics, including:  the element from which the atom derives;  the amino acid in which the atom is contained;  the chain on which this amino acid is found;  the position of the amino acid within this chain; and  the 3D coordinates (x, y, z) of the atom in angstroms (10-10 meters).Lines 2,159 to 2,175 of the .pdb file for the experimentally verified SARS-CoV-2 spike protein structure, PDB entry 6vxx. These 17 lines contain information on the atoms taken from two amino acids, alanine and tyrosine. The rows corresponding to these amino acids’ alpha carbons are shown in green and appear as “CA” under the “Element” column. We have labeled the columns to make it clear what each column corresponds to: “Index” refers to the number of the amino acid; “Element” identifies the chemical element to which this atom corresponds; “Chain” indicates which chain the atom is found on; “Position” identifies the position in the protein of the amino acid from which the atom is taken; “Coordinates” indicates the x, y, and z coordinates of the atom’s location (in angstroms).Note: The above figure shows just part of the information needed to fully represent a protein structure. For example, a .pdb file will also contain information about the disulfide bonds between amino acids. For more information, consult the official PDB documentation).The Kabsch algorithm can be fooledAlthough the Kabsch algorithm is powerful, we should be careful when using it. Consider the figure below, which shows two toy protein structures; the orange structure (S) is identical to the blue structure (T) except for a change in a single bond angle between the third and fourth amino acids. And yet this tiny change in the protein’s structure causes a significant increase in d(si, ti) for every i greater than 3, which inflates the RMSD.(Top) Two hypothetical protein structures that differ in only a single bond angle between the third and fourth amino acids, shown in red. Each circle represents an alpha carbon. (Bottom left) Superimposing the first three amino acids shows how much the change in the bond angle throws off the computation of RMSD by increasing the distances between corresponding alpha carbons. (Bottom right) The Kabsch algorithm would align the centers of gravity of the two structures in order to minimize RMSD between corresponding alpha carbons. This alignment belies the similarity in the structures and makes it difficult for the untrained observer to notice that the two proteins only differ in a single bond angle.Another way in which the Kabsch algorithm can be tricked is in the case of an appended substructure that throws off the ordering of the amino acids. The following figure shows a toy example of a structure into which we incorporate a loop, thus throwing off the natural order of comparing amino acids. (The same effect is caused if one or more amino acids are deleted from one of the two proteins.)Two toy two protein structures, one of which includes a loop of three amino acids. After the loop, each amino acid in the orange structure will be compared against an amino acid that occurs farther long in the blue structure, thus increasing d(si, ti)2 for each such amino acid.To address this second issue, biologists will often align the sequences of two proteins first, discarding any positions that do not align well when it comes time to perform the RMSD calculation. We will soon see an example of a protein sequence alignment when comparing the coronavirus spike proteins.In short, if the RMSD of two proteins is large, then we should be wary of concluding that the proteins are very different, and we may need to combine RMSD with other methods of structure comparison. But if the RMSD is small (e.g., just a few angstroms), then we can have confidence that the proteins are indeed similar.We are now ready to apply the Kabsch algorithm to compare the structures that we predicted for human hemoglobin subunit alpha and the SARS-CoV-2 spike protein against their experimentally validated structures.Visit tutorialAssessing the accuracy of our structure prediction modelsIn the tutorials occurring earlier in this module, we used publicly available protein structure prediction servers to predict the structure of human hemoglobin subunit alpha (using ab initio modeling) and the SARS-CoV-2 spike protein (using homology modeling). We will now see how well our models performed by showing the values of RMSD produced by the Kabsch algorithm when comparing each of these models against the validated structures.Ab initio (QUARK) models of Human Hemoglobin Subunit AlphaThe table below shows the RMSD between each of the five predicted structures returned by QUARK and the validated structure of human hemoglobin subunit alpha (PDB entry: 1si4). We are tempted to conclude that our ab initio prediction was a success. However, because human hemoglobin subunit alpha is such a short protein (141 amino acids), researchers would consider this RMSD score high.            Quark Model      RMSD                  QUARK1      1.58              QUARK2      2.0988              QUARK3      3.11              QUARK4      1.9343              QUARK5      2.6495      It is tempting to conclude that our ab initio prediction was a success. However, because human hemoglobin subunit alpha is such a short protein (141 amino acids), researchers would consider this RMSD score high.Homology models of SARS-CoV-2 S proteinIn the homology tutorial, we used SWISS-MODEL and Robetta to predict the structure of the SARS-CoV-2 spike protein, and we used GalaxyWeb to predict the structure of this protein’s receptor binding domain (RBD). In addition to our predicted models, we will also assess five predicted models of the full SARS-CoV-2 spike protein released early in the COVID-19 pandemic by Rosetta@Home and published to the Seattle Structural Genomics Center for Infectious Disease (SSGCID).Because the work needed to generate these models was distributed over many users’ machines, comparing the RMSD scores obtained by the Rosetta@Home models against our own may provide insights on the effect of computational power on the accuracy of predictions. The SSGCID models can be found here.GalaxyWEBFirst, we consider the five GalaxyWEB models produced for the spike protein RBD. The following table shows the RMSD between each of these models and the validated SARS-CoV-2 RBD (PDB entry: 6lzg).            GalaxyWEB      RMSD                  Galaxy1      0.1775              Galaxy2      0.1459              Galaxy3      0.1526              Galaxy4      0.1434              Galaxy5      0.1202      All of these models have an excellent RMSD score and can be considered very accurate. Note that their RMSD is more than an order of magnitude lower than the RMSD computed for our ab initio model of hemoglobin subunit alpha, despite the fact that the RBD is longer (229 amino acids).SWISS-MODELWe now shift to homology models of the entire spike protein and start with SWISS-MODEL. The following table shows the RMSD between each of three structures produced by SWISS-MODEL and the validated structure of the SARS-CoV-2 spike protein (PDB entry: 6vxx).            SWISS MODEL      RMSD                  SWISS1      5.8518              SWISS2      11.3432              SWISS3      11.3432      The first structure has a lowest RMSD over the three models, and even though its RMSD (5.818) is significantly higher than what we saw for the GalaxyWEB prediction for the RBD, keep in mind that the spike protein is 1281 amino acids long, and so the sensitivity of RMSD to slight changes should give us confidence that our models are on the right track.RobettaFinally, we produced five predicted structures of a single chain of the SARS-CoV-2 spike protein using Robetta. The following table compares each of them against the validated structure of the SARS-CoV-2 spike protein (PDB: 6vxx).            Robetta      RMSD                  Robetta1      3.1189              Robetta2      3.7568              Robetta3      2.9972              Robetta4      2.5852              Robetta5      12.0975      STOP: Which do you think performed more accurately on our predictions: SWISS-MODEL or Robetta?SSGCIDAs explained above, the SSGCID models of the S protein released by Rosetta@Home used large amounts of computational resources. Therefore, we might expect to see RMSD scores lower than those of our models. Like before, we will compare the models to the validated structure of (PDB: 6vxx). This time, we will assess the accuracy of predictions of a single chain as well as of the entire spike protein.            SSGCID      RMSD (Full Protein)      RMSD (Single Chain)                  SSGCID1      3.505      2.7843              SSGCID2      2.3274      2.107              SSGCID3      2.12      1.866              SSGCID4      2.0854      2.047              SSGCID5      4.9636      4.6443      As we might expect due to their access to the resources of thousands of users’ computers, the SSGCID models outperform our SWISS-MODEL models. Yet a typical threshold for whether a predicted structure is accurate is if its RMSD compared to a validated structure is smaller than 2.0 angstroms, and the SSGCID models are not quite this accurate, even with access to hundreds of contributors’ machines.The inability of the SSGCID models to attain a very accurate predicted structure may make it seem that protein structure prediction is a lost cause. Perhaps biochemists should head back to expensive experimental validations and ignore the musings of computational scientists. In the conclusion to part 1, we will find hope.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Methylation Helps a Bacterium Adapt to Differing Concentrations",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/adaptation",
        "date"     : "",
        "content"  : "Bacterial tumbling frequencies remain constant for different attractant concentrationsIn the previous lesson, we explored the signal transduction pathway by which E. coli can change its tumbling frequency in response to a change in the concentration of an attractant. But the reality of cellular environments is that the concentration of an attractant can vary across several orders of magnitude. The cell therefore needs to detect not absolute concentrations of an attractant but rather relative changes.E. coli detects relative changes in its concentration via adaptation to these changes. If the concentration of attractant remains constant for a period of time, then regardless of the absolute value of the concentration, the cell returns to the same background tumbling frequency. In other words, E. coli demonstrates robustness to the attractant concentration in maintaining its default tumbling behavior.However, our current model is not able to address this adaptation. If the ligand concentration increases in the model, then phosphorylated CheY will plummet and remain at a low steady state.In this lesson, we will investigate the biochemical mechanism that E. coli uses to achieve such a robust response to environments with different background concentrations. We will then expand the model we built in the previous lesson to replicate the bacterium’s adaptive response.Bacteria remember past concentrations using methylationRecall that in the absence of an attractant, CheW and CheA readily bind to an MCP, leading to greater autophosphorylation of CheA, which in turn phosphorylates CheY. The greater the concentration of phosphorylated CheY, the more frequently the bacterium tumbles.Signal transduction is achieved through phosphorylation, but E. coli maintains a “memory” of past environmental concentrations through a chemical process called methylation. In this  reaction, a methyl group (-CH3) is added to an organic molecule; the removal of a methyl group is called demethylation.Every MCP receptor contains four methylation sites, meaning that between zero and four methyl groups can be added to the receptor. On the plasma membrane, many MCPs, CheW, and CheA molecules form an array structure. Methylation reduces the negative charge on the receptors, stabilizing the array and facilitating CheA autophosphorylation. The more sites that are methylated, the higher the autophosphorylation rate of CheA, which means that CheY has a higher phosphorylation rate, and tumbling frequency increases.Note that we now have two different ways that tumbling frequency can be elevated. First, if the concentration of an attractant is low, then CheW and CheA freely form a complex with the MCP, and the phosphorylation cascade passes phosphoryl groups to CheY, which interacts with the flagella and keeps tumbling frequency high. Second, an increase in MCP methylation can also boost CheA autophosphorylation and lead to an increased tumbling frequency.Methylation of MCPs is achieved by an additional protein called CheR. When bound to MCPs, CheR methylates ligand-bound MCPs faster12, and so the rate of MCP methylation by CheR is higher if the MCP is bound to a ligand.3. Let’s consider how this fact affects a bacterium’s behavior.Say that E. coli encounters an increase in attractant concentration. Then the lack of a phosphorylation cascade will mean less phosphorylated CheY, and so the tumbling frequency will decrease. However, if the attractant concentration levels off, then the tumbling frequency will flatten, while CheR starts methylating the MCP. Over time, the rising methylation will increase CheA autophosphorylation, bringing back the phosphorylation cascade and raising tumbling frequency back to default levels.Just as the phosphorylation of CheY can be reversed, MCP methylation can be undone to prevent methylation from being permanent. In particular, an enzyme called CheB, which like CheY is phosphorylated by CheA, demethylates MCPs (as well as autodephosphorylates). The rate of an MCP’s demethylation is dependent on the extent to which the MCP is methylated. In other words, the rate of MCP methylation is higher when the MCP is in a low methylation state, and the rate of demethylation is faster when the MCP is in a high methylation state.3The figure below adds CheR and CheB to provide a complete picture of the core pathways influencing chemotaxis. To model these pathways and see how our simulated bacterial system responds to different relative attractant concentrations, we will need to add quite a few molecules and reactions to our current model.The chemotaxis signal-transduction pathway with methylation included. CheA phosphorylates CheB, which methylates MCPs, while CheR demethylates MCPs. Blue lines denote phosphorylation, grey lines denote dephosphorylation, green arrows denote methylation, and red arrows denote demethlyation. Image modified from Parkinson Lab’s illustrations.Combinatorial explosion and the need for rule-based modelingWe would like to add the methylation reactions discussed above to the model that we built in the previous lesson and see if this model can replicate the adaptation behavior of E. coli in the presence of a changing attractant concentration. Before incorporating the adaptation mechanisms in our BioNetGen model, we will first describe the reactions that BioNetGen will need.We begin with considering the MCP complexes. In the phosphorylation tutorial, we identified two components relevant for reactions involving MCPs: a ligand-binding component l and a phosphorylation component Phos. The adaptation mechanism introduces two additional reactions: methylation of the MCP by CheR, and demethylation of the MCP by CheB.We also need to include binding and dissociation reactions between the MCP and CheR because under normal conditions, most CheR are bound to MCP complexes.4 We will therefore introduce two additional components to the MCP molecules in addition to their phosphorylation components: r (denoting CheR-binding) and Meth (denoting methylation states). In our simulation, we will use three methylation levels (low, medium, and high) rather than five because these three states are most involved in the chemotaxis response to attractants.5Imagine for a moment that we were attempting to specify every reaction that could take place in our model. To specify an MCP, we would need to tell the program whether it is bound to a ligand (two possible states), whether it is bound to CheR (two possible states), whether it is phosphorylated (two possible states), and which methylation state it is in (three possible states). Therefore, a given MCP has 2 · 2 · 2 · 3 = 24 total states.Say that we are simulating the simple reaction of a ligand binding to an MCP, which we originally wrote as T + L  TL. We now need this reaction to include 12 of the 24 states, the ones corresponding to the MCP being unbound to the ligand. Our previously simple reaction would become 12 different reactions, one for each possible unbound state of the complex molecule T. And if the situation were just a little more complex, with the ligand molecule L having n possible states, then we would have 12n reactions. Imagine trying to debug a model in which we had accidentally incorporated a typo when transcribing just one of these reactions!In other words, as our model grows, with multiple different states for each molecule involved in each reaction, the number of reactions we need to represent the system grows very fast; this phenomenon is called combinatorial explosion and means that building realistic models of biochemical systems at scale can be daunting.A major benefit of using a rule-based modeling language such as the one developed by BioNetGen is that it circumvents combinatorial explosion by consolidating many reactions into a single rule. For example, when modeling ligand-MCP binding, we can summarize the 12 different reactions with the rule “a free ligand molecule binds to an MCP that is not bound to a ligand molecule.” In the BioNetGen language, this rule is represented by the same one-line expression as it was in the previous lesson:LigandReceptor: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_disWe will not bog down the text with a full specification of all the rules needed to add methylation to our model while avoiding combinatorial explosion. If you’re interested in the details, please follow our tutorial.Visit tutorialBacterial tumbling is robust to large sudden changes in attractant concentrationIn the figures that follow, we plot the concentration over time of each molecule for different values of l0, the initial concentration of ligand. From what we have learned about E. coli, we should see the concentration of phosphorylated CheY (and therefore the bacterium’s tumbling frequency) drop before returning to its original equilibrium. But will our simulation capture this behavior?First, we add a relatively small amount of attractant, setting l0 equal to 10,000. The system returns so quickly to an equilibrium in phosphorylated CheY that it is difficult to imagine that the attractant has had any effect on tumbling frequency.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 10,000 initial attractant ligand particles.If instead l0 is equal to 100,000, then we obtain the figure below. After an initial drop in the concentration of phosphorylated CheY, it returns to equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100,000 initial attractant ligand particles.When we increase l0 by another factor of ten to 1 million, the initial drop is more pronounced, but the system returns just as quickly to equilibrium. Note how much higher the concentration of methylated receptors are in this figure compared to the previous figure; however, there are still a significant concentration of receptors with low methylation, indicating that the system may be able to handle an even larger jolt of attractant.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with one million initial attractant ligand particles.When we set l0 equal to 10 million, we give the system this bigger jolt. Once again, the model returns to its previous CheY equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with ten million initial attractant ligand particles.Finally, with l0 equal to 100 million, we see what we might expect: the steepest drop in phosphorylated CheY yet, but a system that is able to return to equilibrium.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100 million initial attractant ligand particles.Our model, which is built on real reaction rate parameters, provides compelling evidence that the E. coli chemotaxis system is robust to changes in its environment across several orders of magnitude of attractant concentration. This robustness has been observed in real bacteria67, as well as replicated by other computational simulations8.Aren’t bacteria magnificent?However, our work is not done. We have simulated E. coli adapting to a single sudden change in its environment, but life is about responding to continual change. So in the next lesson, we will further examine how our simulated bacterium responds in an environment in which the ligand concentration is changing constantly.Next lessonAdditional resourcesIf you find chemotaxis biology as interesting as we do, then we suggest the following resources.  An amazing introduction to chemotaxis from the  Parkinson Lab.  A good overview of chemotaxis: Webre et al. 2003   A review article on chemotaxis pathway and MCPs:  Baker et al. 2005.  A more recent review article of MCPs: Parkinson et al. 2015.  Lecture notes on robustness and integral feedback: Berg 2008.            Amin DN, Hazelbauer GL. 2010. Chemoreceptors in signaling complexes: shifted conformation and asymmetric coupling. Available online &#8617;              Terwilliger TC, Wang JY, Koshland DE. 1986. Kinetics of Receptor Modification - the multiply methlated aspartate receptors involved in bacterial chemotaxis. The Journal of Biolobical Chemistry. Available online &#8617;              Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. &#8617; &#8617;2              Lupas A., and Stock J. 1989. Phosphorylation of an N-terminal regulatory domain activates the CheB methylesterase in bacterial chemotaxis. J Bio Chem 264(29):17337-42. Available online &#8617;              Boyd A., and Simon MI. 1980. Multiple electrophoretic forms of methyl-accepting chemotaxis proteins generated by stimulus-elicited methylation in Escherichia coli. Journal of Bacteriology 143(2):809-815. Available online &#8617;              Shimizu TS, Delalez N, Pichler K, and Berg HC. 2005. Monitoring bacterial chemotaxis by using bioluminescence resonance energy transfer: absence of feedback from the flagellar motors. PNAS. Available online &#8617;              Krembel A., Colin R., Sourijik V. 2015. Importance of multiple methylation sites in Escherichia coli chemotaxis. Available online &#8617;              Bray D, Bourret RB, Simon MI. 1993. Computer simulation of phosphorylation cascade controlling bacterial chemotaxis. Molecular Biology of the Cell. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Gene Autoregulation is Surprisingly Frequent",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/autoregulation",
        "date"     : "",
        "content"  : "Using randomness to determine statistical significanceIn the previous lesson, we introduced the transcription factor network, in which a protein X is connected to a protein Y if X is a transcription factor that regulates the production of Y. We also saw that in the E. coli transcription factor network, there seemed to be a large number of loops, or edges connecting some transcription factor X to itself, and which indicate the autoregulation of X.In the introduction, we briefly referenced the notion of a network motif, a structure occurring often throughout a network. Our assertion is that the loop is a motif in the transcription factor network; how can we defend this claim?To argue that a loop is indeed a motif in the E. coli transcription factor network, we will apply a paradigm that occurs throughout computational biology (and science in general) when determining whether an observation is statistically significant. We will compare our observation against a  randomly generated dataset. Without getting into the statistical details, if an observation is frequent in a real dataset, and rare in a random dataset, then it is likely to be statistically significant. Randomness saves the day again!STOP: How can we apply this paradigm of a randomly generated dataset to determine whether a transcription factor network contains a significant number of loops?Comparing a real transcription factor network against a random networkTo determine whether the number of loops in the transcription factor network of E. coli is statistically significant, we will compare this number against the expected number of loops we would find in a randomly generated transcription factor network. If the former is much higher than the latter, then we have strong evidence that some selective force is causing gene autoregulation.There are multiple ways to generate a random network, but we will use an approach developed by Edgar Gilbert in 19591. Given an integer n and a probability p (between 0 and 1), we form n nodes. Then, for every possible pair of nodes X and Y, we connect X to Y via a directed edge with probability p; that is, we simulate the process of flipping a weighted coin that has probability p of coming up “heads”.Note: Simulating a weighted coin flip amounts to generating a “random” number x between 0 and 1, and considering it “heads” if x is less than p and “tails” otherwise. For more details on random number generation, consult Programming for Lovers).STOP: What should n and p be if we are generating a random network to compare against the E. coli transcription factor network?The full E. coli transcription factor network contains thousands of genes, most of which are not transcription factors. As a result, the approach described above may form a random network that connects non-transcription factors to other nodes, which we should avoid.Instead, we will focus on the network comprising only those E. coli transcription factors that regulate each other. This network has 197 nodes and 477 edges, and so we will begin by forming a random network with n = 197 nodes.We then select p to ensure that our random network will on average have 477 edges. To do so, we note that there are n2 pairs of nodes that could have an edge connecting them (n choices for the starting node and n for the ending node). If we were to set p equal to 1/n2, then we would expect on average only to see a single edge in the random network. We therefore scale this value by 477 and set p equal to 477/n2  0.0123 so that we will see, on average, 477 edges in our random network.In the following tutorial, we write some code to count the number of loops in the real E. coli transcription factor network. We then build a random network and compare the number of loops found in this network against the number of loops in the real network.Visit tutorialThe negative autoregulation motifIn a random network containing n nodes, the probability that a given edge is a loop is 1/n. Therefore, if the network has e edges, then we would on average see e/n loops in the network. In our case, n is 197, and e is 477; therefore, on average, we expect to see 197/497  2.42 loops in a random network. Yet the real network of E. coli transcription factors that regulate each other contains 130 loops!Furthermore, in a random network, we would expect about half of the edges correspond to activation, and the other half correspond to repression. But if you followed the preceding tutorial, then you know that of the 130 loops in the E. coli network, 35 correspond to activation and 95 correspond to repression. Just as you would be surprised to flip a coin 130 times and see “heads” come up 95 times, the cell must be negatively autoregulating for some reason.Not only is autoregulation an important feature of transcription factors, but these transcription factors tend to negatively autoregulate. Why in the world would organisms have evolved the process of autoregulation only for a transcription factor to slow down its own transcription? In the next lesson, we will begin to unravel the mystery.Next lesson            Gilbert, E.N. (1959). “Random Graphs”. Annals of Mathematical Statistics. 30 (4): 1141–1144. doi:10.1214/aoms/1177706098. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Protein Biochemistry",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/biochemistry",
        "date"     : "",
        "content"  : "The four levels of protein structureProtein structure is a broad term that encapsulates four different levels of description. A protein’s primary structure refers to the amino acid sequence of the polypeptide chain. The primary structure of human hemoglobin subunit alpha can be downloaded here, and the primary structure of the SARS-CoV-2 spike protein, which we showed earlier, can be downloaded here.A protein’s secondary structure describes its highly regular, repeating intermediate substructures that form before the overall protein folding process completes. The two most common such substructures, shown in the figure below, are alpha helices and beta sheets. Alpha helices occur when nearby amino acids wrap around to form a tube structure; beta sheets occur when nearby amino acids line up side-by-side.The general shape of alpha helices (left) and beta sheets (right), the two most common protein secondary structures. Source: Cornell, B. (n.d.). https://ib.bioninja.com.au/higher-level/topic-7-nucleic-acids/73-translation/protein-structure.htmlA protein’s tertiary structure describes its final 3D shape after the polypeptide chain has folded and is stable. Throughout this module, when discussing the “shape” or “structure” of a protein, we are almost exclusively referring to its tertiary structure. The figure below shows the tertiary structure of human hemoglobin subunit alpha. For the sake of simplicity, this figure does not show the position of every atom in the protein but rather represents the protein shape as a composition of secondary structures.The tertiary structure of human hemoglobin subunit alpha. Within the structure are multiple alpha helix secondary structures. Source: https://www.rcsb.org/structure/1SI4.Finally, some proteins have a quaternary structure, which describes the protein’s interaction with other copies of itself to form a single functional unit, or a multimer. Many proteins do not have a quaternary structure and function as an independent monomer. The figure below shows the quaternary structure of hemoglobin, which is a multimer consisting of two alpha subunits and two beta subunits.The quaternary structure of human hemoglobin, which consists of two alpha subunits (shown in red) and two beta subunits (shown in blue). Source: https://commons.wikimedia.org/wiki/File:1GZX_Haemoglobin.png]().As for SARS-CoV and SARS-CoV-2, the spike protein is a homotrimer, meaning that it is formed of three essentially identical units called chains, each one translated from the corresponding region of the coronavirus’s genome; these chains are colored differently in the figure below. In this module, when discussing the structure of the spike protein, we typically are referring to the structure of a single chain.A side and top view of the quaternary structure of the SARS-CoV-2 spike protein homotrimer, with its three chains highlighted using different colors.The structural units making up proteins are often hierarchical, and the spike protein is no exception. Each spike protein chain is a dimer, consisting of two subunits called S1 and S2. Each of these subunits further divides into protein domains, distinct structural units within the protein that fold independently and are typically responsible for a specific interaction or function. For example, the SARS-CoV-2 spike protein has a receptor binding domain (RBD) located on the S1 subunit of the spike protein that is responsible for interacting with the human ACE2 enzyme; the rest of the protein does not come into contact with ACE2. We will say more about the RBD soon.Proteins seek the lowest energy conformationNow that we know a bit more about how protein structure is defined, we will discuss why proteins fold in the same way every time. In other words, what are the factors driving the magic algorithm?Amino acids’ side chain variety causes amino acids to have different chemical properties, which can lead to different conformations being more chemically “preferable” than others. For example, the table below groups the twenty amino acids commonly occurring in proteins according to chemical properties. Nine of these amino acids are hydrophobic (also called nonpolar), meaning that their side chains tend to be repelled by water, and so we tend to find these amino acids sheltered from the environment on the interior of the protein.A chart of the twenty amino acid grouped by chemical properties. The side chain of each amino acid is highlighted in blue. Image courtesy: OpenStax Biology.We can therefore view protein folding as finding the tertiary structure that is the most stable given a polypeptide’s primary structure. A central theme of the previous module on bacterial chemotaxis was that a system of chemical reactions moves toward equilibrium. The same principle is true of protein folding; when a protein folds into its final structure, it is obtaining a conformation that is as chemically stable as possible.To be more precise, the potential energy (sometimes called free energy) of a molecule is the energy stored within an object due to its position, state, and arrangement. In molecular mechanics, the potential energy is made up of the sum of bonded energy and non-bonded energy.As the protein bends and twists into a stable conformation, bonded energy derives from the protein’s covalent bonds, as well as the bond angles between adjacent amino acids, and the torsion angles that we introduced in the previous lesson.Non-bonded energy comprises electrostatic interactions and van der Waals interactions. Electrostatic interactions refer to the attraction and repulsion force from the electric charge between pairs of charged amino acids. Two of the twenty standard amino acids (arginine and lysine) are positively charged, and two (aspartic acid and glutamic acid) are negatively charged. Two nearby amino acids of opposite charge may interact to form a salt bridge. Conformations that contain salt bridges and keep apart similarly charged amino acids will therefore have lower free energy contributed by electrostatic interactions.As for van der Waals interactions, atoms are dynamic systems, with electrons constantly buzzing around the nucleus, as shown in the figure below.A carbon-12 atom showing six positively charged protons (green), six neutrally charged neutrons (blue), and six negatively charged electrons (red). Under typical circumstances, the electrons will most likely be distributed uniformly around the nucleus.However, due to random chance, the electrons in an atom could momentarily be unevenly distributed on one side of the nucleus. Because electrons are negatively charged, this uneven distribution will cause the atom to have a temporary negative charge on the side with the excess electrons and a temporary positive charge on the opposite side. As a result of this charge, one side of the atom may attract only the oppositely charged components of another atom, creating an induced dipole in that atom in turn as shown in the figure below. Van der Waals forces refer to the attraction and repulsion between atoms because of induced dipoles.Due to random chance, the electrons in the atom on the left have clustered on the left side of the atom, creating a net negative charge on this side of the atom, and therefore a net positive charge on the right side of the atom. This polarity induces a dipole in the atom on the right, whose electrons are attracted because of van der Waals forces.As the protein folds, it seeks a conformation of lowest total potential energy based on the combination of all these forces. For an analogy, imagine a ball on a slope, as shown in the following figure. The ball will tend to move down the slope unless it is pushed uphill by some outside force, making it unlikely that the ball will wind up at the top of a hill. We will keep this analogy in mind as we return to the problem of protein structure prediction.A ball on a hill offers a classic analogy for a protein folding into the lowest energy structure. As the ball is more likely to move down into a valley, a protein is more likely to fold into a more stable, low-energy conformation.Next lesson"
     
   } ,
  
   {
     
        "title"    : "A Biochemically Accurate Model of Bacterial Chemotaxis",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/biochemistry",
        "date"     : "",
        "content"  : "Transducing an extracellular signal to a cell’s interiorWe now turn to the question of how the cell conveys the extracellular signal it has detected via the process of signal transduction to the cell’s interior and produces an action. When E. coli senses an increase in the concentration of glucose, meaning that more ligand-receptor binding is taking place at the receptor that recognizes glucose, how does the bacterium change its behavior?The engine of signal transduction is phosphorylation, a chemical reaction that attaches a phosphoryl group (PO3-) to an organic molecule.  Phosphoryl modifications serve as an information exchange of sorts because, as we will see, they activate or deactivate certain enzymes.A phosphoryl group usually comes from one of two sources. First, the phosphoryl can be broken off of an adenosine triphosphate (ATP) molecule, the “energy currency” of the cell, producing adenosine diphosphate (ADP). Second, the phosphoryl can be exchanged from a phosphorylated molecule that loses its phosphoryl group in a dephosphorylation reaction.For many cellular responses, including bacterial chemotaxis, a sequence of phosphorylation and dephosphorylation events called a phosphorylation cascade serves to transmit information within the cell about the amount of ligand binding being detected on the cell’s exterior. In this lesson, we discuss how this cascade of chemical reactions leads to a change in bacterial movement.A high-level view of the transduction pathway for chemotaxis is shown in the figure below. The cell membrane receptors that we have been working with are called methyl-accepting chemotaxis proteins (MCPs), and they bridge the cellular membrane, binding both to ligand stimuli in the cell exterior and to other proteins on the inside of the cell. The pathway includes a number of additional proteins, which all start with the prefix “Che” (short for “chemotaxis”).A summary of the chemotaxis transduction pathway. A ligand binding signal is propagated through CheA and CheY phosphorylation, which leads to a response of clockwise flagellar rotation. The blue curved arrow denotes phosphorylation, the grey curved arrow denotes dephosphorylation, and the blue dashed arrow denotes a chemical interaction. Our figure is a simplified view of Parkinson Lab illustrations.On the interior of the cellular membrane, MCPs form complexes with two proteins called CheW and CheA. In the absence of MCP-ligand binding, this complex is more stable, and the CheA molecule autophosphorylates, meaning that it adds a phosphoryl group taken from ATP to itself  a concept that might seem mystical if you have not already followed our discussion of autoregulation in the previous module.Phosphorylated CheA can pass on its phosphoryl group to a molecule called CheY, which interacts with the flagellum in the following way. Each flagellum has a protein complex called the flagellar motor switch that is responsible for controlling the direction of flagellar rotation. The interaction of this protein complex with phosphorylated CheY induces a change of flagellar rotation from counter-clockwise to clockwise. As we discussed earlier in the module, this change in flagellar rotation causes the bacterium to tumble, which in the absence of an increase in attractant occurs every 1 to 1.5 seconds.Yet when a ligand binds to the MCP, the MCP undergoes conformation changes, which reduce the stability of the complex with CheW and CheA. As a result, CheA is less readily able to autophosphorylate, which means that it does not phosphorylate CheY, which cannot change the flagellar rotation to clockwise, and so the bacterium is less likely to tumble.In short, attractant ligand binding causes more phosphorylated CheA and CheY, which means that it causes fewer flagellar interactions and therefore less tumbling, so that the bacterium will run for a longer period of time.Note: A critical part of this process is that if a ligand is detected, and the cell has a high concentration of CheY, then it needs to decrease the CheY concentration quickly. Otherwise, the cell will not be able to change its tumbling frequency. To this end, the cell is able to dephosphorylate CheY using an enzyme called CheZ.Adding phosphorylation events to our model of chemotaxisWe would like to use the Gillespie algorithm that we introduced in the previous lesson to simulate the reactions driving chemotaxis signal transduction and see what happens if the bacterium “senses an attractant”, meaning that the attractant ligand’s concentration increases and leads to more receptor-ligand binding.This model will be more complicated than any we have introduced thus far. We will need to account for both bound and unbound MCP molecules, as well as phosphorylated and unphosphorylated CheA and CheY enzymes. We will also need to model phosphorylation reactions of CheA, which depend on the current concentrations of bound and unbound MCP molecules. We will at least make the simplifying assumption that the MCP receptor is permanently bound to CheA and CheW, so that we do not need to represent these molecules individually. In other words, rather than thinking about CheA autophosphorylating, we will think about the receptor that includes CheA autophosphorylating.In the previous lesson, we very briefly introduced BioNetGen as a way to convert reactions into a software package applying the Gillespie algorithm. However, BioNetGen is useful not only for running particle-free simulations, but also because it implements its own language for rule-based modeling.Say that we wanted to specify all particles and the reactions involving them in the manner used up to this point in the book. We would need one particle type to represent MCP molecules, another particle type to represent ligands, and a third to represent bound complexes. A bound complex molecule binds with CheA and CheW and can be either phosphorylated or unphosphorylated, necessitating two different molecule types. In turn, CheY can be phosphorylated or unphosphorylated as well, requiring two more particles.Instead, the BioNetGen language will allow us to conceptualize this system much more concisely using rules that apply to particles that are in a variety of states. The BioNetGen representation of the four particles in our model is shown below. The notation Phos~U~P indicates that a given molecule type can be either phosphorylated or unphosphorylated, so that we do not need multiple different expressions to represent the molecule.L(t)             #ligand moleculeT(l,Phos~U~P)    #receptor complexCheY(Phos~U~P)CheZ()The conciseness of BioNetGen’s molecule representation helps us represent our reactions concisely as well. We first reproduce the reversible binding and dissociation reaction from the previous lesson.LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_disNext, we represent the phosphorylation of the MCP complex. Recall that the phosphorylation of CheA can occur at different rates depending on whether the MCP is bound, and so we will need two different reactions to model these different rates. In our model, the phosphorylation of the MCP will occur at one fifth the rate when it is bound to the attractant ligand.FreeTP: T(l,Phos~U) -&gt; T(l,Phos~P) k_T_phosBoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*0.2Finally, we represent the phosphorylation and dephosphorylation of CheY. The former requires a phosphorylated MCP receptor, while the latter is done with the help of a CheZ molecule that can be in any state.YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phosYDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephosNow that we have converted the reactions from the chemotaxis signal transduction pathway into BioNetGen’s rule-based language, we would like to see what happens when we change the concentrations of the ligand. Ideally, the bacterium should be able to distinguish different ligand concentrations. That is, the higher the concentration of an attractant ligand, the lower the concentration of phosphorylated CheY, and the lower the tumbling frequency of the bacterium.But does higher attractant concentration in our model really lead to a lower concentration of CheY? Let’s find out by incorporating the phosphorylation pathway into our ligand-receptor model in the following tutorial.Visit tutorialChanging ligand concentrations leads to a change in internal molecular concentrationsThe following figure shows the concentrations of phosphorylated CheA and CheY in a system at equilibrium in the absence of ligand. As we might expect, these concentrations remain at steady state (with some healthy noise), and so the cell stays at its background tumbling frequency.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation in which no ligand is present.The sudden addition of 5,000 attractant ligand molecules increases the concentration of bound receptors, therefore leading to less CheA autophosphorylation, and less phosphorylated CheY.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 5,000 initial attractant ligand particles.If we instead add 100,000 attractant molecules, then we see an even more drastic decrease in phosphorylated CheA and CheY.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100,000 initial attractant ligand particles.This Gillespie model confirms the biological observations that an increase in attractant reduces the concentration of phosphorylated CheY. This reduction takes place remarkably quickly, with the cell attaining a new equilibrium in a fraction of a second.And yet you may remain skeptical of our model. After all, the biochemistry powering chemotaxis may be elegant, but it is also simple, and perhaps you are not surprised that the model’s particle concentrations reproduced the response of E. coli to an attractant ligand.But what we have shown in this lesson is just part of the story. In the next lesson, we will see that the biochemical realities of chemotaxis are even more complicated, and for good reason  this added complexity will allow E. coli, and our model of it, to react to a dynamic world with surprising sophistication.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Preorder the Book",
        "category" : "",
        "tags"     : "",
        "url"      : "/buy_the_book/",
        "date"     : "",
        "content"  : "We are producing a textbook companion to this course called Biological Modeling: A Short Tour, which will be available in spring 2022 in both print and electronic formats. The book will follow the core content of the course, while the tutorials  which may change frequently in the future  will be hosted on this website.Publication of the book was graciously funded via Kickstarter. You can still preorder the book at an exclusive introductory price and receive an E-book for free by finding us on Indiegogo.Note: Print copies of the book are only available to backers in North America at this time. If you are outside North America, you can still preorder an electronic copy of the book. If the campaign is successful, we will be providing the book to international markets; please let us know if you’d like to be notified when the book is published by joining our international mailing list at https://forms.gle/SHQXvjpbPPWKYHuU8."
     
   } ,
  
   {
     
        "title"    : "An Overview of Classification and k-Nearest Neighbors",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/classification",
        "date"     : "",
        "content"  : "The classification problemLabeling images of WBCs according to their family is a specific instance of an ubiqitous problem in data science, in which we wish to classify each object in a given dataset into one of k classes.In our ongoing example, the data are images of WBCs, and the classes are the three main families of WBCs (granulocytes, lymphocytes, and monocytes). To take a different example, our data could be tumor genomes sequenced from cancer patients, which we want to classify based on which therapeutic should be prescribed for the patient. Or the data may be the past sales behavior of shoppers, who we want to classify into two classes based on a prediction of whether they will buy a new product.The iris flower datasetA classical dataset commonly used for motivating classification is the iris flower dataset, which was compiled by Edgar Anderson12, and which Ronald Fisher used in a seminal paper on classification in 19363. Anderson took measurements from 150 iris flowers, equally divided among three species (see figure below).                                                                                                        Representative images of the three species of iris included in Anderson’s iris flower dataset. Image courtesies, from left to right: Emma Forsberg, unknown author, Robert H. Mohlenbrock.  Anderson measured four attributes, or features, of each of the flowers in his dataset: both the width and height of the flower’s petal, and both the width and height of the flower’s sepal (a green offshoot beneath the petals). The features and species labels for twelve of the flowers in the iris flower dataset are shown in the table below (click here for the full dataset). Fisher considered whether it was possible to classify each flower according to its species using only the features Anderson had measured.            Sepal length (cm)      Sepal width (cm)      Petal length (cm)      Petal width (cm)      Species                  5.1      3.5      1.4      0.2      I. setosa              4.9      3.0      1.4      0.2      I. setosa              4.7      3.2      1.3      0.2      I. setosa              4.6      3.1      1.5      0.2      I. setosa              7.0      3.2      4.7      1.4      I. versicolor              6.4      3.2      4.5      1.5      I. versicolor              6.9      3.1      4.9      1.5      I. versicolor              5.5      2.3      4.0      1.3      I. versicolor              6.3      3.3      6.0      2.5      I. virginica              5.8      2.7      5.1      1.9      I. virginica              7.1      3.0      5.9      2.1      I. virginica              6.3      2.9      5.6      1.8      I. virginica      A table containing values of the four features for twelve members of the iris flower dataset. The complete dataset was accessed from the University of California, Irvine Machine Learning Repository].STOP: What characteristics do flowers from each species tend to share in terms of the four features in the table above?From flowers to vectorsIf we were to use only two features in the Iris flower dataset, then a flower’s feature values x and y could be represented as a point in two-dimensional space (x, y). The figure below shows such a plot for the features of petal length (x-axis) and petal width (y-axis).Petal width (x-axis) plotted against width (y-axis) for each of the flowers in the iris flower dataset, colored by species. There are not fifty points corresponding to every species because some flowers have the same petal length and width.Note how stark the pattern in the above figure is. Even though we chose only two features from the iris flowers, the points associated with the flowers can be divided into three main clusters by species. In other words, nearby points tend to correspond to flowers from the same species.If we were to use all four features for the iris dataset, then every flower would be represented by a point in four-dimensional space. For example, the first flower in our initial table of iris features would be represented by the point (5.1, 3.5, 1.4, 0.2). In general, when classifying a collection of data with n features, each element in the dataset can be represented by a feature vector of length n, whose i-th value corresponds to its value for the i-th feature.Classifying unknown elements with k-nearest neighborsFor the iris flower dataset, recall our observation that data points were more likely to belong to the same class if they were nearby. Our hope is that this fact is true for other datasets, that elements from the same class will have feature vectors that are close in n-dimensional space. If so, then we can classify a data point whose class is unknown by determining which data points with known classification it is near.STOP: Consider the point with unknown class (gray) in the figure below. Should it be assigned to the class of the green points or to the class of the blue points?An unknown point (gray) along with a collection of nearby points belonging to two classes, colored green and blue.The preceding question indicates that classifying points can be surprisingly open-ended. Because of this freedom, researchers have devised a variety of different approaches for classifying data given data with known classes.We will discuss a simple but powerful classification algorithm called k-nearest neighbors, or k-NN4. In k-NN, we fix a positive integer k in advance, which will be used for classification of all points. Then, for each point with unknown class, we assign it the class possessed by the largest number of its k closest neighbors.In the ongoing example, if we were using k equal to 1, then we would assign the unknown point to the green class (see figure below).When using k-NN with k equal to 1, we classify an unknown point according to the point of known class that is nearest; for this reason, the unknown point above would be assigned to the green class.However, with the same data and k equal to 4, the figure below shows that a majority of the k nearest neighbors are blue, and so we classify the unknown point as blue. This example reinforces a theme of this course, that the results of an algorithm can be sensitive to our choice of parameters.When using k-NN with k equal to 4, k-NN will classify the unknown point as blue, since three of its four closest neighbors are blue.STOP: When k = 2 or k = 6 for the above figure, note that we obtain a tie in the number of points from each known class belonging to the k nearest neighbors of a point with unknown class. How could we break ties in k-NN?In the more general case in which feature vectors have length n, we can determine which points are nearest to a given point by using the Euclidean distance, which generalizes the distance between two points in two-dimensional space to vectors in n-dimensional space. Say that we have the vectors x = (x1, x2, …, xn) and y = (y1, y2, …, yn). Then the Euclidean distance between them is given by the sum of squares of differences between corresponding vector elements:\[d(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n-y_n)^2}\]We now have learned how to use k-NN to classify feature vectors with unknown classes given vectors with known classes. There is just one problem: how can we convert an image of a WBC into a vector?Next lesson            Anderson E (1935) The irises of the Gaspe Peninsula. Bulletin of the American Iris Society 59: 2-5. &#8617;              Anderson E (1936) The species problem in Iris. Annals of the Missouri Botanical Garden 23(3):457-509. Available online &#8617;              Fisher RA (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(2):179-188. Available online &#8617;              Fix E. and Hodges J.L. (1951) Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: The Robustness of Biological Oscillators",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/conclusion",
        "date"     : "",
        "content"  : "Biological oscillators must be robustIf your heart skips a beat when you are watching a horror movie, then it should return quickly to its natural rhythm. When you hold your breath to dive underwater, your normal breathing resumes at the surface. And regardless of what functions your cells perform or what disturbances they find in their environment, they should be able to maintain a normal cell cycle.An excellent illustration of oscillator robustness is the body’s ability to handle jet lag. There is no apparent reason why humans would have evolved to be able to fly halfway around the world. And yet our circadian clock is so resilient that after a few days of fatigue and crankiness, our circadian clock returns to a normal daily cycle.In the previous lesson, we saw that the repressilator, a three-element motif, can exhibit oscillations even in a noisy environment of randomly moving particles. The repressilator’s resilience makes us wonder how well it can respond to a disturbance in the concentrations of its particles.A coarse-grained repressilator modelWith our work on Turing patterns in the prologue, tracking the movements of many individual particles led to a slow simulation that did not scale well given more particles or reactions. This observation led us to devise a coarse-grained reaction-diffusion model that was still able to produce Turing patterns. We used a cellular automaton because the concentrations of particles varied in different locations and were diffusing at different rates.We would like to devise a coarse-grained model of the repressilator. However, the particles diffuse at the same rate and are uniform across the simulation, and so there is no need to track concentrations in individual locations. As a result, we will use a simulation that assumes that the particles are well-mixed.For example, say that we are modeling a degradation reaction. If we start with a concentration of 10,000 X particles, then after a single time step, we will simply multiply the number of X particles by (1-k), where k is a parameter related to the rate of the degradation reaction.As for a repression reaction like X + Y  X, we can update the concentration of Y particles by subtracting some factor r times the current concentrations of X and Y particles.We will further discuss the technical details behind a well-mixed reaction-diffusion model in the next module. In the meantime, we would like to see what happens when we make a major disturbance to the concentration of one of the particles in the well-mixed model. Do the particle concentrations resume their oscillations? To build this model of the repressilator, check out the following tutorial.Visit tutorialThe repressilator is robust to disturbanceThe figure below shows plots over time of concentrations of each particle in our well-mixed simulation of the repressilator.  Midway through this simulation, we greatly increase the concentration of Y particles.A plot of particle concentrations in the well-mixed repressilator model over time. Adding a significant number of Y particles to our simulation (the second blue peak) produces little ultimate disturbance to the concentrations of the three particles, which return to normal oscillations within a single cycle.Because of the spike in the concentration of Y, the reaction Y + Z  Y suppresses the concentration of Z for longer than usual, and so the concentration of X is free to increase for longer than normal. As a result, the next peak of X particles is higher than normal.We might hypothesize that this process would continue, with a tall peak in the concentration of Z. However, the peak in the concentration of Z is no taller than normal, and the next peak shows a normal concentration of X. In other words, the system has very quickly absorbed the blow of an increase in concentration of Y and returned to normal within one cycle.Even with a much larger jolt to the concentration of Y, the concentrations of the three particles return to normal oscillations very quickly, as shown below.A larger increase in the concentration of Y particles than in the previous figure does not produce a substantive change in the system.The repressilator is not the only network motif that leads to oscillations of particle concentrations, but robustness to disturbance is a shared feature of all these motifs. Furthermore, the repressilator is not the most robust oscillator that we can build. Researchers have shown that at least five components are typically needed to build a very robust oscillator,1 which may help explain why real oscillators tend to have more than three components.In the next module, we will encounter a much more involved biochemical process, with far more molecules and reactions, that is used by bacteria to cleverly (and robustly) explore their environment. In fact, we will have so many particles and so many reactions that we will need to completely rethink how we formulate our model.In the meantime, check out the exercises below to continue building your understanding of transcription factor networks and network motifs.Visit exercises            Castillo-Hair, S. M., Villota, E. R., &amp; Coronado, A. M. (2015). Design principles for robust oscillatory behavior. Systems and Synthetic Biology, 9(3), 125–133. https://doi.org/10.1007/s11693-015-9178-6 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: Toward Deep Learning",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/conclusion",
        "date"     : "",
        "content"  : "A brief introduction to neural networksYou may be wondering what the state of the art is for WBC image classification. The best known classifier (CITE) uses high-resolution images in addition to a technique called deep learning. You may have seen this term wielded with mystical reverence, but you may very well not know what it means, and so in this module’s conclusion, we will make an aside to explain the basics of this method.Neurons are cells in the nervous system that are electrically charged and that use this charge as a method of communication to other cells. As you are reading this text, huge numbers of neurons are firing in your brain as it processes the visual information that it receives. The basic structure of a neuron is shown in the figure below.		The components of a neuron. Electrical signals are passed down axons and exchanged at terminal boundaries between neurons. Image courtesy: Jennifer Walinga.In 1943, Warren McCulloch, a neuroscientist, and Walter Pitts, a logician, devised a small network modeling a neuron called a McCulloch-Pitts neuron. (CITE) A McCulloch-Pitts neuron has a fixed threshold θ and takes as input n binary variables x1, …, xn, where each of these variables is equal to either 0 or 1. The MP neuron outputs 1 if x1 +  + xn  θ; otherwise, it returns 0. If an MP neuron outputs 1, then we say that it fires.The McCulloch-Pitts neuron with n equal to 2 and θ equal to 2 is shown in the figure below. The only way that this neuron will fire is if both inputs x1 and x2 are equal to 1.		A McCullough-Pitts neuron with n = 2 and θ = 2. The neuron will fire (i.e., output 1) precisely when both input variables are equal to 1; if either is equal to 0, then x1 + x2 will be less than θ and the neuron will not fire (i.e., output 0).In 1957, Frank Rosenblatt generalized the McCulloch-Pitts neuron into a perceptron. A perceptron also has a threshold constant θ and n binary input variables xi, but it also includes a collection of real-valued constant weights wi that are applied to each input variable. That is, the neuron will output 1 (fire) precisely when w1 · x1 + w2 · x2 +  + wn · xn  θ.Note: A McCulloch-Pitts neuron is a perceptron for which all of the wi are equal to 1.For example, consider the perceptron shown in the figure below. We assign the weight wi to the edge connecting input variable xi to the neuron.		A perceptron with two input variables. The perceptron includes a constant threshold and constant weights w1 and w2. The perceptron outputs 1 precisely when the weighted sum w1 · x1 + w2 · x2 is greater than or equal to θ, and it outputs 0 otherwise.We can generalize the perceptron even further. by allowing the input variables xi to have arbitrary real values (often, these inputs are constrained to be between 0 and 1). Then, rather than the neuron rigidly firing when w1 · x1 + w2 · x2 +  + wn · wn is greater than or equal to θ, we pass this weighted sum into a function f called an activation function, so that the neuron’s output is not 1 or 0 but rather f(w1 · x1 + w2 · x2 +  + wn · wn). The figure below shows the most general form of an artificial neuron for two inputs.		A general form of an artificial neuron for two input variables x1 and x2, two fixed weights w1 and w2, and an activation function f. The output of the neuron, rather than being strictly 0 or 1, is f(w1 · x1 + w2 · x2).A commonly used activation function is the logistic function, f(x) = 1/(1 + e-x), shown in the figure below. Note that the output of this function ranges between 0 (when its input is very negative) and 1 (when its input is very positive).		A plot of the logistic function f(x) = 1/(1 + e-x), an increasing function whose values range between 0 and 1.STOP: What is the activation function used by a perceptron?The outputs of mathematical functions can be used as inputs to other functions via function composition. For example, if f(x) = 2x-1, g(x) = ex, and h(x) = cos(x), then h(g(f(x))) = cos(e2x-1). Similarly, the outputs of artificial neurons can be used as inputs to other neurons with (possibly) different weights and activation functions. Linking neurons together in this way produces a neural network such as the one shown in the figure below.INSERT FIGUREWe have a huge amount of freedom when building neural networks, since we can link up neurons however we like, and since the weights of all incoming edges at each neuron are allowed to vary, as well as the activation function used at each neuron. This freedom will mean that the number of possible neural networks is incomprehensibly large, so that neural networks are very powerful, but they can also be difficult to interpret.Framing a classification problem using neural networksThe question is how to use neural networks to solve a classification problem involving real data. Earlier in this module, we discussed converting each data point x into a feature vector (x1, x2, …, xn) representing each of the n feature values of x. As a result, these n variables of the data’s feature vectors become the n input variables of a neural network.The most typical design of a neural network for classification is then to connect all of the input variables into each one of a collection of (possibly many) artificial neurons, called a hidden layer. These neurons may then be connected in some combination to neurons in another layer, which are connected to neurons in another layer, and so on. As a result, many practical neural networks may have several hidden layers amounting to thousands of neurons, each with their own input weights and possibly different activation functions. The most common usage of the term deep learning refers to solving problems using many hidden layers of neurons; the discussion of the many pitfalls in designing neural networks for deep learning would cover an entire course much longer than this one.The remaining question is what the output of our neural network should be. If we are classifying our data into k classes, then we typically would like to connect the final hidden layer of neurons to k output neurons. Ideally, if we plug in the values of the feature vector for a given data point x that we know belongs to the i-th class, then we would like for the i-th output neuron to output a value close to 1, and all other output neurons to output a value close to 0.A potential neural network model is illustrated in the figure below for our example of categorizing WBC images into three classes. Because the number of nodes and edges in a real network is enormous, this figure uses gray boxes to indicate layers containing potentially many hidden neurons, as well as dashed edges to represent connecting many (if not all) nodes in one layer to each of the nodes in the next layer.		An illustration of a potential neural network used for WBC image classification. This network assumes that each WBC is represented by n features, which serve as the input variables for the network. A number of hidden layers of additional neurons may be used, with connections between some of the neurons in adjacent layers. A final output layer of three neurons corresponds to each of the three WBC classes; our hope is that the weights of the neurons in the network are chosen so that the appropriate neuron outputs a value close to 1 corresponding to an image's class, and that the other two neurons output values close to 0.In order for such a neural network to classify objects in our dataset, we must find the “best” choices for the weights assigned to input variables at each neuron, assuming that we have decided on which activation function(s) to use for the network’s neurons. A network may have hundreds of thousands of these input weights, and for a given application it will be initially unclear which weights to use to produce the magic output in which an object belonging to the i-th class produces an output close to 1 for the i-th output neuron (and an output close to 0) for other neurons.Finding the best choice of parametersIn a previous lesson (CITE), we discussed how to assess the results of a classification algorithm like k-NN on a collection of data with known classes. To this end, we established a subset of our data called the validation set and asked how well the classification algorithm performed on these data, pretending that we did not know the correct class for each data point.To generalize this idea to our neural network classifier, we divide our data with known classes into a training set and a test set. We then seek the choice of parameters that “performs the best” on the training set, ignoring the test set for the moment.Of course, we should specify what it means for a neural network with established parameters to perform well on the training set. Each data point x in the training set has a ground truth classification vector c(x) = (c1, c2, …, ck), where exactly one of the ci is equal to 1 corresponding to the correct class of x, and the other ci are equal to 0. The data point x also has an output vector o(x) = (o1, o2, …, ok), where oi is the output of the i-th output vector in the network when x is used as the input data.Fortunately, we have been using a method of comparing two vectors throughout this course, namely RMSD. The RMSD between an object’s classification vector and its output vector for a given neural network measures how well the network classified this data point, with a value close to 0 representing a good fit.Therefore, we can obtain a good measure of how well a neural network performs on a training set by taking the mean RMSD between classification and output vectors for every element in the training set. As a result, we have a clear problem to solve: select the input weights of each neuron in a neural net minimizing this mean RMSD for objects in the training set.NEEDS EXAMPLEOnce we have chosen a collection of weight parameters for our network that perform well on the training set, we then assess how well this choice of parameters performs on the test set that we left out initially. To this end, we can insert the feature vector of each test set object x as input into the network, and then consult the output vector o(x) = (o1, o2, …, ok) for this object. Whichever i maximizes oi for this output vector is the assigned class of x. Now we are ready to apply what we have learned earlier in this module for quantifying the quality of a classifier, using metrics such as accuracy, recall, specificity, and precision, to determine how well the neural network is performing as a classifier.However, because the typical neural network has many parameters, we should be wary of the curse of dimensionality. As many a student completing a deep learning project has learned, it may be very difficult to obtain a network and parameters that perform well on the training set, and even if a model has a low mean RMSD for the training set, it may perform horribly on the test set. (The former situation is called “underfitting” and the latter is called “overfitting”.)All of this discussion, however, assumes that we have already determined our network’s choice of parameters to produce a low mean RMSD for our training set. But how can we find this best choice of parameters in the first place?Exploring a parameter spaceEvery neural network contains a collection of hundreds of thousands or more input weights. We can think of each of these weight parameters as the coordinate of a very long vector in a very high-dimensional space.From the perspective of producing low mean RMSD between output and classification vectors for a collection of training data, the vast majority of the points in this space (i.e., choices of weight parameters) are worthless. Among this vast infinitude, a tiny number of these points will provide a good result on our training set. Even with substantial computational resources, finding one of these points is daunting.The situation in which we find ourselves is remarkably similar to that of a bacterium exploring an environment with sparse food resources. Or identifying the conformation of a protein having the lowest potential energy.Our idea, then, is to borrow what we have learned in this course, and apply a local search algorithm to explore the parameter space. As with ab initio structure prediction, we could make a variety of small changes to the parameter values, and then update our current parameters to the ones producing the greatest decrease in mean RMSD between output and classification vectors. We continue this process until this mean RMSD stops decreasing. This idea serves as the foundation for the most popular approach for determining parameters for a neural network, called gradient descent.STOP: How can we avoid getting stuck in a local minimum? What does a local minimum mean in the context of parameter search?To avoid getting stuck in a local minimum, we then run this local search algorithm for many different randomly chosen initial parameters. In the end, we take the choice of parameters minimizing mean RMSD over all of the different runs of gradient descent.Many modifications of gradient descent have been developed, but all are based on the core idea of local search. Whether we are classifying cellular images, predicting the structure of a protein, or modeling the exploration of a humble bacterium, local search can prove powerful. And yet despite these three problems demonstrating how biological problems can be solved with modeling, biology on the whole largely remains an untouched universe just waiting to be explored.Note: If you find yourself interested in deep learning and would like to learn more, check out the excellent Neural Networks and Deep Learning online book by Michael Nielsen.Thank you!If you are reading this, and you’ve made it through our entire course, thank you for joining us on this journey! We are grateful that you gave your time to us, and we wish you the best on your educational journey. Please don’t hesitate to contact us if you have any questions, feedback, or would like to leave us a testimonial; we would love to hear from you."
     
   } ,
  
   {
     
        "title"    : "Conclusion: Turing Patterns are Fine-Tuned",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/conclusion",
        "date"     : "",
        "content"  : "In both the particle-based and automaton model for Turing patterns, we observed that the model is fine-tuned, meaning that very slight changes in parameter values can lead to significant changes in the system. These changes could convert spots to stripes, or they could influence how clearly defined the boundaries of the Turing patterns are.The figure below shows how the Turing patterns produced by the Gray-Scott model change as the kill and feed rates vary. The square at position (x, y) shows the pattern obtained as the result of a Gray-Scott simulation with kill rate x and feed rate y. Notice how much the patterns change! You may like to tweak the parameters of the Gray-Scott simulation from the previous lesson to see if you can reproduce these differing patterns.Changing kill (x-axis) and feed (y-axis) parameters greatly affects the Turing patterns obtained in the Gray-Scott model. Each small square shows the patterns obtained from a given choice of feed and kill rate.  Note that many choices of parameters do not produce Turing patterns, which only result from a narrow “sweet spot” band of parameter choices. Image courtesy: Robert Munafo.1Later in this course, we will see an example of a biological system that is the opposite of fine-tuned. In a robust system, perturbations such as variations in parameters do not lead to substantive changes in the ultimate behavior of the system.  Robustness is vital for processes, like your heartbeat, that must be resilient to small environmental changes.It turns out that although Turing’s work offers a compelling argument for how zebras might have gotten their stripes, the exact mechanism causing these stripes to form is still an unresolved question. However, researchers have shown that the skin of zebrafish does exhibit Turing patterns because two types of pigment cells serve as “particles” following a reaction-diffusion model much like the one we presented in this prologue.2Finally, take a look at the following two photos of giant pufferfish.34 Genetically, these fish are practically identical, but their skin patterns are very different. What may seem like a drastic change from spots to stripes is likely attributable to a small change of parameters in a fine-tuned biological system that, like much of life, is powered by randomness.                                                                        Two similar pufferfish with very different skin patterns. (Left) A juvenile Mbu pufferfish with a very familiar pattern. (Right) An adult Mbu pufferfish exhibiting another familiar pattern.  A final noteThank you for making it this far! We hope that you are enjoying the course. You can visit the exercises for this module or skip ahead to the next module by clicking on the appropriate button below. We also ask that you complete the course survey if you have not done so already.Visit exercisesNext module            “Reaction-Diffusion by the Gray-Scott Model: Pearson’s Parametrization” © 1996-2020 Robert P. Munafo https://mrob.com/pub/comp/xmorphia/index.html &#8617;              Nakamasu, A., Takahashi, G., Kanbe, A., &amp; Kondo, S. (2009). Interactions between zebrafish pigment cells responsible for the generation of Turing patterns. Proceedings of the National Academy of Sciences of the United States of America, 106(21), 8429–8434. https://doi.org/10.1073/pnas.0808622106 &#8617;              NSG Coghlan, 2006 Creative Commons Attribution-Share Alike 3.0 Unported &#8617;              Chiswick Chap, 20 February 2012, Creative Commons Attribution-Share Alike 3.0 Unported &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: The Beauty of *E. coli*&#39;s Robust Randomized Exploration Algorithm",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/conclusion",
        "date"     : "",
        "content"  : "Two randomized exploration strategiesIn the prologue, we saw that a particle taking a collection of n unit steps in random directions will wind up on average a distance proportional to \(\sqrt{n}\) units away from its starting position. We now will compare such a random walk against a modified algorithm that emulates the  behavior of E. coli by changing the length of a step (i.e., how long the bacterium tumbles) based on the relative change in background attractant concentration.We will represent a bacterium as a particle traveling in two-dimensional space. Units of distance will be measured in µm; recall from the introduction that a bacterium can cover 20 µm in a second during an uninterrupted run. The bacterium will start at the origin (0, 0).We will use L(x, y) to denote the ligand concentration at (x, y) and establish a point (called the goal) at which L(x, y) is maximized. We will place the goal at (1500, 1500), so that the bacterium must travel a significant distance from the origin to reach the goal.We would like the ligand concentration L(x, y) to decrease exponentially the farther we travel from the goal. We therefore set L(x, y) = 100 · 106 · (1-d/D), where d is the distance from (x, y) to the goal, and D is the distance from the origin to the goal, which in this case is 1500√2  2121 µm. At the origin, the attractant concentration is equal to 100, and at the goal, the attractant concentration is equal to 100,000,000.STOP: How can we quantify how well a bacterium has done at finding the attractant?We are comparing two different cellular behaviors, and so in the spirit of Module 1, we will simulate many random walks of a particle following each of the two strategies, described in what follows. (The total time needed by our simulation should be large enough to allow the bacterium to have enough time to reach the goal.) For each strategy, we will then measure how far on average a bacterium with each strategy is from the goal at the end of the simulation.Strategy 1: Standard random walkTo model a particle following an “unintelligent” random walk strategy, we first select a random direction of movement along with a duration of tumble. The angle of reorientation is a random number selected uniformly between  and 360°. The duration of each tumble is a “wait time” of sorts and follows an exponential distribution with experimentally verified mean equal to 0.1 seconds1. As the result of a tumble, the cell only changes its orientation, not its position.We then select a random duration to run and let the bacterium run in that direction for the specified amount of time. The duration of each run follows an exponential distribution with mean equal to the experimentally verified value of 1 second.We then iterate the two steps of tumbling and running until the total time allocated for the simulation has elapsed.In the following tutorial, we simulate this naive strategy using a Jupyter notebook that will also help us visualize the results of the simulation.Visit tutorialStrategy 2: Chemotactic random walkIn our second strategy, we mimic the real response of E. coli to its environment based on what we have learned about chemotaxis throughout this module. The simulated bacterium will still follow a run and tumble model, but the duration of each run, which is a function of its tumbling frequency, will depend on the relative change in attractant concentration that it detects.To ensure a mathematically controlled comparison, we will use the same approach for sampling the duration of a tumble and the direction of a run as in the first strategy.We have seen in this module that it takes E. coli about half a second to respond to a change in attractant concentration. We use tresponse to denote this “response time”; to produce a reasonable model of chemotaxis, we will check the attractant concentration of a running particle at the particle’s current location every tresponse seconds.We will then measure the percentage difference between the attractant concentration L(x, y) at the cell’s current point and the attractant concentration at the cell’s previous point, tresponse in the past; we denote this difference as Δ[L]. If Δ[L] is equal to zero, then the probability of a tumble in the next tresponse seconds should be the same as the likelihood of a tumble in the first strategy over the same time period. If Δ[L] is positive, then the probability of a tumble should be greater than it was in strategy 1; if Δ[L] is negative, then the probability of a tumble should be less than it was in strategy 1.To model the relationship between the likelihood of a tumble and the value of Δ[L], we will let t0 denote the mean background run duration, which in the first strategy was equal to one second. We would like to use a simple formula for the expected run duration like t0 * (1 + 10 · Δ[L]).Unfortunately, there are two issues with this formula. First, if Δ[L] is less than -0.1, then the run duration could be negative. Second, if Δ[L] is large, then the bacterium will run for so long that it could reach the goal and run past it.To fix the first issue, we will first take the maximum of t0 * (1 + 10 · Δ[L]) and some small positive number c (we will use c equal to 0.000001). As for the second issue, we will then take the minimum of this expression and 4 · t0. This resulting value,min(max(t0 * (1 + 10 · Δ[L]), c), 4 · t0),becomes the mean run duration of a bacterium based on the recent relative change in concentration.STOP: What is the mean run duration when Δ[L] is equal to zero? Is this what we would hope?As with the first strategy, our simulated cell will alternate between tumbling and running in a random direction until the total time devoted to the simulation has elapsed. The only difference in the second strategy is that we will measure the percentage change in concentration Δ[L] between a cell’s current point and its previous point every tresponse seconds. After determining a mean run time according to the expression above, we will sample a random number p from an exponential distribution with this mean run time, and the cell will tumble after p seconds if p is smaller than tresponse.In the following tutorial, we will adapt the Jupyter notebook that we built in the previous tutorial to simulate this second strategy and run it many times, taking the average of the simulated bacteria’s distance to the goal.Visit tutorialComparing the effectiveness of our two random walk strategiesThe following figure visualizes the trajectories of three cells over 500 seconds using strategy 1 (left) and strategy 2 (right) with a default tumbling frequency t0 of one second. Unlike the cells following strategy 1, the cells following strategy 2 quickly hone in on the goal and remain near it.Three sample trajectories for the standard random walk strategy (left) and chemotactic random walk strategy (right). The standard random walk strategy is shown on the left, and the chemotactic random walk is shown on the right. Redder regions correspond to higher concentrations of ligand, with a goal having maximum concentration at the point (1500, 1500), which is indicated with a blue square. Each particle’s walk is colored from darker to lighter colors across the time frame of its trajectory.Of course, we should be wary of such a small sample size. To confirm that what we observed in these trajectories is true in general, we will compare the two strategies over many simulations. The following figure visualizes the particle’s average distance to the goal over 500 simulations for both strategies and confirms our previous observation that strategy 2 is effective at guiding the simulated particle to the goal. And yet this strategy is driven by random choices of direction of travel, so why would it be so successful?Distance to the goal plotted over time for 500 simulated particles following the standard random walk (pink) and the chemotactic random walk (green). The dark lines indicate the average distance over all simulations, and the shaded area around each line represents one standard deviation from the average.The chemotactic strategy works because it uses a  “rubber band” effect. If the bacterium is traveling down an attractant gradient (i.e., away from an attractant), then it is not allowed to travel very far in a single step before it is forced to tumble. If an increase of attractant is detected, however, then the cell can travel farther in a single direction before tumbling. On average, this effect serves to pull the bacterium in the direction of increasing attractant, even though the directions in which it travels are random.A tiny change to a simple, unsuccessful randomized algorithm can therefore produce an elegant approach for exploring an unknown environment. But we left one more question unanswered: why is a default frequency of one tumble per second stable across a wide range of bacteria? To address this question, we will see how changing t0, the default time for a run step in the absence of change in attractant concentration, affects the ability of a simulated bacterium following strategy 2 to reach the goal. You may like to adjust the value of t0 in the previous tutorial yourself, or follow the tutorial below.Visit tutorialWhy is background tumbling frequency constant across bacterial species?The following figures show three trajectories for a few different values of t0 and a simulation that lasts for 800 seconds. First, we set t0 equal to 0.2 seconds and see that the simulated bacteria are not able to walk far enough in a single step to head toward the goal.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 0.2 seconds.If we increase t0 to 5.0 seconds, then cells can run for so long that they may run past the goal without being able to apply the brakes by tumbling.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 5.0 seconds.When we set t0 equal to 1.0, then the figure below shows a “Goldilocks” effect in which the simulated bacterium can run for long enough at a time to head quickly toward the goal, and it tumbles frequently enough to keep it there.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 1.0 seconds.The figure below visualizes average particle distance to the goal over time for 500 particles using a variety of choices of t0. It confirms that tumbling every second by default is “just right” for finding an attractant.Average distance to the goal over time for 500 cells. Each colored line indicates the average distance to the goal over time for a different value of t0; the shaded area represents one standard deviation.Below, we reproduce the video from earlier in this module showing E. coli moving towards a sugar crystal. This video shows that the behavior of real E. coli is similar to that of our simulated bacteria. Bacteria generally move towards the crystal and then remain close to it; some bacteria run by the crystal, but they turn around to move toward the crystal again.      Bacteria are even smarter than we thoughtIf you closely examine the video above, then you may be curious about the way that bacteria turn around and head back toward the attractant. When they reorient, their behavior appears more intelligent than simply walking in a random direction. As is often true in biology, the reality of the system that we are studying turns out to be more complex than we might at first imagine.The direction of bacterial reorientation is not completely random, but rather follows a normal distribution with mean of 68° and standard deviation of 36°2. That is, the bacterium typically does not tend to make as drastic of a change to its orientation as it would in a pure random walk, which would on average have a change in orientation of 90°.Furthermore, the direction of the bacterium’s reorientation also depends on whether the cell is traveling in the correct direction.3 If the bacterium is moving up an attractant gradient, then it makes smaller changes in its reorientation angle, a feature that helps the cell continue moving straight if it is traveling in the direction of an attractant.We are fortunate to have this wealth of research on chemotaxis in E. coli, which may be the single most studied biological system from the perspective of demonstrating how chemical reactions produce emergent behavior. However, for the study of most biological systems, finding a clear thread connecting a reductionist view of the system to that system’s holistic behavior remains a dream. (For example: how can your thoughts while reading this parenthetical aside be distilled into the firings of individual neurons?)  Regardless of what the future holds, we can be confident that uncovering the underlying mechanisms of biological systems will continue to inspire the work of biological modelers for many years.Visit exercises            Saragosti J., Siberzan P., Buguin A. 2012. Modeling E. coli tumbles by rotational diffusion. Implications for chemotaxis. PLoS One 7(4):e35412. available online. &#8617;              Berg HC, Brown DA. 1972. Chemotaxis in Escherichia coli analysed by three-dimensional tracking. Nature. Available online &#8617;              Saragosti J, Calvez V, Bournaveas, N, Perthame B, Buguin A, Silberzan P. 2011. Directional persistence of chemotactic bacteria in a traveling concentration wave. PNAS. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Part 1 Conclusion: Protein Structure Prediction is Solved! (Kind of…)",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/conclusion_part_1",
        "date"     : "",
        "content"  : "In 1967, the Soviets founded an entire research insitute dedicated to protein research and decoding nature’s magic protein folding algorithm. Most of its founding scientists are dead now, but the institute carries on.Protein structure prediction is an old problem, but continual algorithmic improvements and increasing computational resources have led biologists around the world to wish for the day when they could consider protein structure prediction to be solved.That day has come. Kind of.Every two years since 1994, a global effort called Critical Assessment of protein Structure Prediction (CASP) has allowed modelers to test their protein structure prediction algorithms against each other. The contest organizers compile a (secret) collection of experimentally verified protein structures and then run all submitted algorithms against these proteins.The 14th iteration of this contest, held in 2020, was won in a landslide. The second version of AlphaFold,1 one of the projects of DeepMind (an Alphabet subsidiary), vastly outperformed the world’s foremost structure prediction approaches, including those that we discussed in this module.The algorithm powering AlphaFold is an extremely involved method based on deep learning, a topic that we will discuss in this work’s final module. If you’re interested in learning more about this algorithm, consult the AlphaFold website or this excellent blog post by Mohammed al Quraishi: https://bit.ly/39Mnym3.Instead of using RMSD, CASP scores a predicted structure against a known structure using the global distance test (GDT). For some threshold t, we first take the percentage of alpha carbon positions for which the distance between corresponding alpha carbons in the two structures is at most t. The GDT score that CASP uses then averages the percentages obtained when t is equal to each of 1, 2, 4, and 8 angstroms. A GDT score of 90% is considered good, and a score of 95% is considered excellent (i.e., comparable to minor errors resulting from experimentation) 2.We will show a few plots to illustrate the decisiveness of AlphaFold’s CASP victory. The first graph, which is shown in the figure below, compares the scores of AlphaFold against the second-place algorithm (a product of David Baker’s laboratory, which developed the Robetta and Rosetta@Home software that we used in this module).A plot of GDT scores for AlphaFold2 (shown in blue) and Baker lab (shown in orange) submissions over all proteins in the CASP14 contest. AlphaFold2 finished first in CASP14, and Baker lab finished second. Image courtesy: Mohammed al Quraishi.We can appreciate the margin of victory in the above figure if we compare it against the margin between the second- and third-place competitors, shown in the figure below.A plot of GDT scores for the Baker lab (shown in blue) and Zhang lab (shown in orange) submissions for all proteins in the CASP14 contest. Baker lab finished second in CASP14, and Zhang lab finished third. Image courtesy: Mohammed al Quraishi.For each protein in the CASP14 contest, we can also compute each algorithm’s z-score, defined as the number of standard deviations that the algorithm’s GDT score falls from the mean GDT score over all competitors. For example, a z-score of 1.4 would imply that the approach performed 1.4 standard deviations above the mean, and a z-score of -0.9 would imply that the approach performed 0.9 standard deviations below the mean.By summing all of an algorithm’s positive z-scores, we obtain a reasonable metric for the relative quality of an algorithm compared to its competitors. If an algorithm’s sum of z-scores is large, then the algorithm racked up lots of positive z-scores, and we can conclude that it performed well. The figure below shows the sum of z-scores for all CASP14 participants and reiterates the margin of AlphaFold’s victory.A bar chart plotting the sum of z-scores for every entrant in the CASP14 contest. AlphaFold2 is shown on the far left; its sum of z-scores is over double that of the second-place submission. Source: https://predictioncenter.org/casp14/zscores_final.cgi.AlphaFold’s CASP14 victory led some scientists  and media outlets  to declare that protein structure