Simple-Jekyll-Search

Build Status dependencies Status devDependencies Status

A JavaScript library to add search functionality to any Jekyll blog.

Use case

You have a blog, built with Jekyll, and want a lightweight search functionality on your blog, purely client-side?

No server configurations or databases to maintain.

Just 5 minutes to have a fully working searchable blog.


Installation

npm

npm install simple-jekyll-search

Getting started

Create search.json

Place the following code in a file called search.json in the root of your Jekyll blog. (You can also get a copy from here)

This file will be used as a small data source to perform the searches on the client side:

---
layout: none
---
[
  
]

Preparing the plugin

Add DOM elements

SimpleJekyllSearch needs two DOM elements to work:

Give me the code

Here is the code you can use with the default configuration:

You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)

For example in _layouts/default.html:

<!-- HTML elements for search -->
<input type="text" id="search-input" placeholder="Search blog posts..">
<ul id="results-container"></ul>

<!-- or without installing anything -->
<script src="https://unpkg.com/simple-jekyll-search@latest/dest/simple-jekyll-search.min.js"></script>

Usage

Customize SimpleJekyllSearch by passing in your configuration options:

var sjs = SimpleJekyllSearch({
  searchInput: document.getElementById('search-input'),
  resultsContainer: document.getElementById('results-container'),
  json: '/search.json'
})

returns { search }

A new instance of SimpleJekyllSearch returns an object, with the only property search.

search is a function used to simulate a user input and display the matching results. 

E.g.:

var sjs = SimpleJekyllSearch({ ...options })
sjs.search('Hello')

💡 it can be used to filter posts by tags or categories!

Options

Here is a list of the available options, usage questions, troubleshooting & guides.

searchInput (Element) [required]

The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.

resultsContainer (Element) [required]

The container element in which the search results should be rendered in. Typically a <ul>.

json (String|JSON) [required]

You can either pass in an URL to the search.json file, or the results in form of JSON directly, to save one round trip to get the data.

searchResultTemplate (String) [optional]

The template of a single rendered search result.

The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.

E.g.

The template

var sjs = SimpleJekyllSearch({
  searchInput: document.getElementById('search-input'),
  resultsContainer: document.getElementById('results-container'),
  json: '/search.json',
  searchResultTemplate: '<li><a href="https://biologicalmodeling.org{url}">{title}</a></li>'
})

will render to the following

<li><a href="/jekyll/update/2014/11/01/welcome-to-jekyll.html">Welcome to Jekyll!</a></li>

If the search.json contains this data

[
    {
      "title"    : "Welcome to Jekyll!",
      "category" : "",
      "tags"     : "",
      "url"      : "/jekyll/update/2014/11/01/welcome-to-jekyll.html",
      "date"     : "2014-11-01 21:07:22 +0100"
    }
]

templateMiddleware (Function) [optional]

A function that will be called whenever a match in the template is found.

It gets passed the current property name, property value, and the template.

If the function returns a non-undefined value, it gets replaced in the template.

This can be potentially useful for manipulating URLs etc.

Example:

SimpleJekyllSearch({
  ...
  templateMiddleware: function(prop, value, template) {
    if (prop === 'bar') {
      return value.replace(/^\//, '')
    }
  }
  ...
})

See the tests for an in-depth code example

sortMiddleware (Function) [optional]

A function that will be used to sort the filtered results.

It can be used for example to group the sections together.

Example:

SimpleJekyllSearch({
  ...
  sortMiddleware: function(a, b) {
    var astr = String(a.section) + "-" + String(a.caption);
    var bstr = String(b.section) + "-" + String(b.caption);
    return astr.localeCompare(bstr)
  }
  ...
})

noResultsText (String) [optional]

The HTML that will be shown if the query didn’t match anything.

limit (Number) [optional]

You can limit the number of posts rendered on the page.

fuzzy (Boolean) [optional]

Enable fuzzy search to allow less restrictive matching.

exclude (Array) [optional]

Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).

success (Function) [optional]

A function called once the data has been loaded.

debounceTime (Number) [optional]

Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no debounceTime (milliseconds) is provided a search will be triggered on each keystroke.


If search isn’t working due to invalid JSON

For example: in search.json, replace

"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[  {% for post in site.posts %}    {      "title"    : "{{ post.title | escape }}",      "category" : "{{ post.category }}",      "tags"     : "{{ post.tags | join: ', ' }}",      "url"      : "{{ site.baseurl }}{{ post.url }}",      "date"     : "{{ post.date }}"    } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in  **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById('search-input'),  resultsContainer: document.getElementById('results-container'),  json: '/search.json'})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search('Hello')```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting & guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById('search-input'),  resultsContainer: document.getElementById('results-container'),  json: '/search.json',  searchResultTemplate: '{title}'})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[    {      "title"    : "Welcome to Jekyll!",      "category" : "",      "tags"     : "",      "url"      : "/jekyll/update/2014/11/01/welcome-to-jekyll.html",      "date"     : "2014-11-01 21:07:22 +0100"    }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({  ...  templateMiddleware: function(prop, value, template) {    if (prop === 'bar') {      return value.replace(/^\//, '')    }  }  ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({  ...  sortMiddleware: function(a, b) {    var astr = String(a.section) + "-" + String(a.caption);    var bstr = String(b.section) + "-" + String(b.caption);    return astr.localeCompare(bstr)  }  ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn't match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn't working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json"content": "{{ page.content | strip_html | strip_newlines }}"```with```json"content": "{{ page.content | strip_html | strip_newlines | remove_chars | escape }}"```If this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js"content": {{ page.content | jsonify }}```**Note: you don't need to use quotes `"` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[  {% for post in site.posts %}    {      "title"    : "{{ post.title | escape }}",      "category" : "{{ post.category }}",      "tags"     : "{{ post.tags | join: ', ' }}",      "url"      : "{{ site.baseurl }}{{ post.url }}",      "date"     : "{{ post.date }}",      "content"  : "{{ post.content | strip_html | strip_newlines }}"    } {% unless forloop.last %},{% endunless %}  {% endfor %}  ,  {% for page in site.pages %}   {     {% if page.title != nil %}        "title"    : "{{ page.title | escape }}",        "category" : "{{ page.category }}",        "tags"     : "{{ page.tags | join: ', ' }}",        "url"      : "{{ site.baseurl }}{{ page.url }}",        "date"     : "{{ page.date }}",        "content"  : "{{ page.content | strip_html | strip_newlines }}"     {% endif %}   } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)> [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"

with

"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[  {% for post in site.posts %}    {      &quot;title&quot;    : &quot;{{ post.title | escape }}&quot;,      &quot;category&quot; : &quot;{{ post.category }}&quot;,      &quot;tags&quot;     : &quot;{{ post.tags | join: &#39;, &#39; }}&quot;,      &quot;url&quot;      : &quot;{{ site.baseurl }}{{ post.url }}&quot;,      &quot;date&quot;     : &quot;{{ post.date }}&quot;    } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in  **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById(&#39;search-input&#39;),  resultsContainer: document.getElementById(&#39;results-container&#39;),  json: &#39;/search.json&#39;})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search(&#39;Hello&#39;)```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting &amp; guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById(&#39;search-input&#39;),  resultsContainer: document.getElementById(&#39;results-container&#39;),  json: &#39;/search.json&#39;,  searchResultTemplate: &#39;{title}&#39;})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[    {      &quot;title&quot;    : &quot;Welcome to Jekyll!&quot;,      &quot;category&quot; : &quot;&quot;,      &quot;tags&quot;     : &quot;&quot;,      &quot;url&quot;      : &quot;/jekyll/update/2014/11/01/welcome-to-jekyll.html&quot;,      &quot;date&quot;     : &quot;2014-11-01 21:07:22 +0100&quot;    }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({  ...  templateMiddleware: function(prop, value, template) {    if (prop === &#39;bar&#39;) {      return value.replace(/^\//, &#39;&#39;)    }  }  ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({  ...  sortMiddleware: function(a, b) {    var astr = String(a.section) + &quot;-&quot; + String(a.caption);    var bstr = String(b.section) + &quot;-&quot; + String(b.caption);    return astr.localeCompare(bstr)  }  ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn&#39;t match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn&#39;t working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json&quot;content&quot;: &quot;{{ page.content | strip_html | strip_newlines }}&quot;```with```json&quot;content&quot;: &quot;{{ page.content | strip_html | strip_newlines | remove_chars | escape }}&quot;```If this doesn&#39;t work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js&quot;content&quot;: {{ page.content | jsonify }}```**Note: you don&#39;t need to use quotes `&quot;` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[  {% for post in site.posts %}    {      &quot;title&quot;    : &quot;{{ post.title | escape }}&quot;,      &quot;category&quot; : &quot;{{ post.category }}&quot;,      &quot;tags&quot;     : &quot;{{ post.tags | join: &#39;, &#39; }}&quot;,      &quot;url&quot;      : &quot;{{ site.baseurl }}{{ post.url }}&quot;,      &quot;date&quot;     : &quot;{{ post.date }}&quot;,      &quot;content&quot;  : &quot;{{ post.content | strip_html | strip_newlines }}&quot;    } {% unless forloop.last %},{% endunless %}  {% endfor %}  ,  {% for page in site.pages %}   {     {% if page.title != nil %}        &quot;title&quot;    : &quot;{{ page.title | escape }}&quot;,        &quot;category&quot; : &quot;{{ page.category }}&quot;,        &quot;tags&quot;     : &quot;{{ page.tags | join: &#39;, &#39; }}&quot;,        &quot;url&quot;      : &quot;{{ site.baseurl }}{{ page.url }}&quot;,        &quot;date&quot;     : &quot;{{ page.date }}&quot;,        &quot;content&quot;  : &quot;{{ page.content | strip_html | strip_newlines }}&quot;     {% endif %}   } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)&gt; [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"

If this doesn’t work when using Github pages you can try jsonify to make sure the content is json compatible:

"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)\n\n[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)\n[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)\n[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)\n\nA JavaScript library to add search functionality to any Jekyll blog.\n\n## Use case\n\nYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?\n\n*No server configurations or databases to maintain*.\n\nJust **5 minutes** to have a **fully working searchable blog**.\n\n---\n\n## Installation\n\n### npm\n\n```sh\nnpm install simple-jekyll-search\n```\n\n## Getting started\n\n### Create `search.json`\n\nPlace the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))\n\nThis file will be used as a small data source to perform the searches on the client side:\n\n```yaml\n---\nlayout: none\n---\n[\n  {% for post in site.posts %}\n    {\n      \"title\"    : \"{{ post.title | escape }}\",\n      \"category\" : \"{{ post.category }}\",\n      \"tags\"     : \"{{ post.tags | join: ', ' }}\",\n      \"url\"      : \"{{ site.baseurl }}{{ post.url }}\",\n      \"date\"     : \"{{ post.date }}\"\n    } {% unless forloop.last %},{% endunless %}\n  {% endfor %}\n]\n```\n\n\n## Preparing the plugin\n\n### Add DOM elements\n\nSimpleJekyllSearch needs two `DOM` elements to work:\n\n- a search input field\n- a result container to display the results\n\n#### Give me the code\n\nHere is the code you can use with the default configuration:\n\nYou need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)\n\nFor example in  **_layouts/default.html**:\n\n```html\n<!-- HTML elements for search -->\n<input type=\"text\" id=\"search-input\" placeholder=\"Search blog posts..\">\n<ul id=\"results-container\"></ul>\n\n<!-- or without installing anything -->\n<script src=\"https://unpkg.com/simple-jekyll-search@latest/dest/simple-jekyll-search.min.js\"></script>\n```\n\n\n## Usage\n\nCustomize SimpleJekyllSearch by passing in your configuration options:\n\n```js\nvar sjs = SimpleJekyllSearch({\n  searchInput: document.getElementById('search-input'),\n  resultsContainer: document.getElementById('results-container'),\n  json: '/search.json'\n})\n```\n\n### returns { search }\n\nA new instance of SimpleJekyllSearch returns an object, with the only property `search`.\n\n`search` is a function used to simulate a user input and display the matching results. \n\nE.g.:\n\n```js\nvar sjs = SimpleJekyllSearch({ ...options })\nsjs.search('Hello')\n```\n\n💡 it can be used to filter posts by tags or categories!\n\n## Options\n\nHere is a list of the available options, usage questions, troubleshooting & guides.\n\n### searchInput (Element) [required]\n\nThe input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.\n\n\n### resultsContainer (Element) [required]\n\nThe container element in which the search results should be rendered in. Typically a `<ul>`.\n\n\n### json (String|JSON) [required]\n\nYou can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.\n\n\n### searchResultTemplate (String) [optional]\n\nThe template of a single rendered search result.\n\nThe templating syntax is very simple: You just enclose the properties you want to replace with curly braces.\n\nE.g.\n\nThe template\n\n```js\nvar sjs = SimpleJekyllSearch({\n  searchInput: document.getElementById('search-input'),\n  resultsContainer: document.getElementById('results-container'),\n  json: '/search.json',\n  searchResultTemplate: '<li><a href=\"{{ site.url }}{url}\">{title}</a></li>'\n})\n```\n\nwill render to the following\n\n```html\n<li><a href=\"/jekyll/update/2014/11/01/welcome-to-jekyll.html\">Welcome to Jekyll!</a></li>\n```\n\nIf the `search.json` contains this data\n\n```json\n[\n    {\n      \"title\"    : \"Welcome to Jekyll!\",\n      \"category\" : \"\",\n      \"tags\"     : \"\",\n      \"url\"      : \"/jekyll/update/2014/11/01/welcome-to-jekyll.html\",\n      \"date\"     : \"2014-11-01 21:07:22 +0100\"\n    }\n]\n```\n\n\n### templateMiddleware (Function) [optional]\n\nA function that will be called whenever a match in the template is found.\n\nIt gets passed the current property name, property value, and the template.\n\nIf the function returns a non-undefined value, it gets replaced in the template.\n\nThis can be potentially useful for manipulating URLs etc.\n\nExample:\n\n```js\nSimpleJekyllSearch({\n  ...\n  templateMiddleware: function(prop, value, template) {\n    if (prop === 'bar') {\n      return value.replace(/^\\//, '')\n    }\n  }\n  ...\n})\n```\n\nSee the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example\n\n### sortMiddleware (Function) [optional]\n\nA function that will be used to sort the filtered results.\n\nIt can be used for example to group the sections together.\n\nExample:\n\n```js\nSimpleJekyllSearch({\n  ...\n  sortMiddleware: function(a, b) {\n    var astr = String(a.section) + \"-\" + String(a.caption);\n    var bstr = String(b.section) + \"-\" + String(b.caption);\n    return astr.localeCompare(bstr)\n  }\n  ...\n})\n```\n\n### noResultsText (String) [optional]\n\nThe HTML that will be shown if the query didn't match anything.\n\n\n### limit (Number) [optional]\n\nYou can limit the number of posts rendered on the page.\n\n\n### fuzzy (Boolean) [optional]\n\nEnable fuzzy search to allow less restrictive matching.\n\n### exclude (Array) [optional]\n\nPass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).\n\n### success (Function) [optional]\n\nA function called once the data has been loaded.\n\n### debounceTime (Number) [optional]\n\nLimit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.\n\n---\n\n## If search isn't working due to invalid JSON\n\n- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.\n\nFor example: in search.json, replace\n\n```json\n\"content\": \"{{ page.content | strip_html | strip_newlines }}\"\n```\n\nwith\n\n```json\n\"content\": \"{{ page.content | strip_html | strip_newlines | remove_chars | escape }}\"\n```\n\nIf this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:\n\n```js\n\"content\": {{ page.content | jsonify }}\n```\n\n**Note: you don't need to use quotes `\"` in this since `jsonify` automatically inserts them.**\n\n\n## Enabling full-text search\n\nReplace `search.json` with the following code:\n\n```yaml\n---\nlayout: none\n---\n[\n  {% for post in site.posts %}\n    {\n      \"title\"    : \"{{ post.title | escape }}\",\n      \"category\" : \"{{ post.category }}\",\n      \"tags\"     : \"{{ post.tags | join: ', ' }}\",\n      \"url\"      : \"{{ site.baseurl }}{{ post.url }}\",\n      \"date\"     : \"{{ post.date }}\",\n      \"content\"  : \"{{ post.content | strip_html | strip_newlines }}\"\n    } {% unless forloop.last %},{% endunless %}\n  {% endfor %}\n  ,\n  {% for page in site.pages %}\n   {\n     {% if page.title != nil %}\n        \"title\"    : \"{{ page.title | escape }}\",\n        \"category\" : \"{{ page.category }}\",\n        \"tags\"     : \"{{ page.tags | join: ', ' }}\",\n        \"url\"      : \"{{ site.baseurl }}{{ page.url }}\",\n        \"date\"     : \"{{ page.date }}\",\n        \"content\"  : \"{{ page.content | strip_html | strip_newlines }}\"\n     {% endif %}\n   } {% unless forloop.last %},{% endunless %}\n  {% endfor %}\n]\n```\n\n\n\n## Development\n\n- `npm install`\n- `npm test`\n\n#### Acceptance tests\n\n```bash\ncd example; jekyll serve\n\n# in another tab\n\nnpm run cypress -- run\n```\n\n## Contributors\n\nThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)\n\n> [@daviddarnes](https://github.com/daviddarnes)\n[@XhmikosR](https://github.com/XhmikosR)\n[@PeterDaveHello](https://github.com/PeterDaveHello)\n[@mikeybeck](https://github.com/mikeybeck)\n[@egladman](https://github.com/egladman)\n[@midzer](https://github.com/midzer)\n[@eduardoboucas](https://github.com/eduardoboucas)\n[@kremalicious](https://github.com/kremalicious)\n[@tibotiber](https://github.com/tibotiber)\nand many others!\n\n## Stargazers over time\n\n[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)\n"

Note: you don’t need to use quotes " in this since jsonify automatically inserts them.

Replace search.json with the following code:

---
layout: none
---
[
  
  ,
  
   {
     
        "title"    : "Page Not Found",
        "category" : "",
        "tags"     : "",
        "url"      : "/404.html",
        "date"     : "",
        "content"  : "Sorry, but the page you were trying to view does not exist  please return to our course homepage."
     
   } ,
  
   {
     
        "title"    : "VMD Tutorial",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/VMDTutorial",
        "date"     : "",
        "content"  : "This is a short tutorial on how to use VMD to visualize molecules and perform some basic analysis. Before you start, make sure to have downloaded and installed VMD.Loading MoleculesThese steps will be on how to load molecules into VMD. We will use the example of 6vw1.Download the protein structure of 6vw1 from the protein data bank.Next we can launch VMD and load the molecule into the program. In VMD Main, navigate to File &gt; New Molecule. Click Browse, select the molecule (6vw1.pdb) and click Load.The molecule should now be listed in VMD Main as well as the visualization in the OpenGL Display.Section to be movedGlycansFor VMD, there is no specific keyword to select glycans. A workaround is to use the keywords: “not protein and not water”. To recreate the basic VMD visualizations from the module of the open-state (6vyb) of SARS-CoV-2 Spike, use the following representations. (For the protein chains, use Glass3 for Material).The end result should look like this: Visualization Exercise Try to recreate the visualization of Hotspot31 for SARS-CoV-2 (same molecule as the tutorial). The important residues and their corresponding colors are listed on the left. "
     
   } ,
  
   {
     
        "title"    : "Ab initio Protein Structure Prediction",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/ab_initio",
        "date"     : "",
        "content"  : "Modeling ab initio structure prediction as an exploration problemPredicting a protein’s structure using only its amino acid sequence is called ab initio structure prediction (ab initio means “from the beginning” in Latin). Although many different algorithms have been developed for ab initio protein structure through the years, these algorithms all find themselves solving a similar problem.Biochemical research has contributed to the development of scoring functions called force fields that use the physicochemical properties of amino acids introduced in the previous lesson to compute the potential energy of a candidate protein shape. For a given choice of force field, we can think of ab initio structure prediction as solving the following problem: given a primary structure of a polypeptide, find its tertiary structure having minimum energy. This problem exemplifies an optimization problem, in which we are seeking an object maximizing or minimizing some function subject to constraints.The formulation of protein structure prediction as an optimization problem may not strike you as similar to anything that we have done before in this course. However, consider once more a bacterium exploring an environment for food. Every point in the bacterium’s “search space” is characterized by a concentration of attractant, and the bacterium’s goal is to reach the point of maximum attractant concentration.In the case of structure prediction, our search space is the collection of all possible conformations of a given protein, and each point in this search space represents a single conformation with an associated potential energy.   Just as we imagined a ball rolling down a hill to find lower energy, we can now imagine exploring the search space of all conformations of a polypeptide to find the conformation having lowest energy. The general problem of exploring a search space to find a point minimizing some function is illustrated in the figure below, in which the height of each point represents the value of the function at that point, and our goal is to find the lowest point in the space.Optimization problems can be thought of as exploring a landscape, in which the height of a point is the value of the function that we wish to optimize. Finding the highest or lowest point in this landscape corresponds to maximizing or minimizing the function over the search space. Image courtesy: David Beamish.A local search algorithm for ab initio structure predictionNow that we have conceptualized the protein structure prediction problem as exploring a search space, we will develop an algorithm to explore this space. Our idea is to use an approach similar to E. coli’s clever exploration algorithm from a previous module: over a sequence of steps, we will consult a collection of nearby points in the space, and then move in the “direction” in which the energy function decreases the most. This approach belongs to a broad category of optimization algorithms called local search algorithms.Adapting a local search algorithm to protein structure prediction requires us to develop a notion of what it means to consider the points “nearby” a given conformation in a protein search space. Many ab initio algorithms start at an arbitrary initial conformation and then make a variety of minor modifications to that structure (i.e., nearby points in the space), updating the current conformation to the modification that produces the greatest decrease in free energy. These algorithms then iterate the process of progressively altering the protein structure to have the greatest decrease in potential energy. They terminate the search after reaching a structure for which no changes to the structure further reduce this energy.Yet returning to the chemotaxis analogy, imagine what happens if we were to place many small sugar cubes and one large sugar cube into the bacterium’s environment. The bacterium will sense the gradient not of the large sugar cube but of its nearest attractant. Because the smaller food sources outnumber the larger food source, the bacterium will likely not reach the point of greatest attractant concentration. In bacterial exploration, this is a feature, not a bug; if the bacterium exhausts one food source, then it will just move to another. But in protein structure prediction, we should be wary of a local search algorithm returning a protein structure that does not have minimum free energy but that does have the property that no “nearby” structures have lower energy.In general, an object in a search space that has a smaller value of the optimization function than neighboring points is called a local minimum. Returning to our landscape analogy, our search space may have many valleys, but in an optimization problem, we are seeking the lowest valley over the entire landscape, called a global minimum.STOP: How could we improve our local search algorithm for structure prediction to avoid winding up in a local minimum?Researchers applying local search algorithms have devised a number of ways to avoid local minima, two of which are so fundamental that we should mention them. First, because the algorithm’s choice of initial conformation has a huge influence on the final conformation, we should run the algorithm multiple times with different starting conformations. This is analogous to allowing multiple bacteria to explore their environment at different starting points. Second, every time we reach a local minimum, we could allow ourselves to change the structure with some probability, thus giving our local search algorithm the chance to “bounce” out of a local minimum. Once again, randomized algorithms help us solve problems!Applying an ab initio algorithm to a protein sequenceTo run an ab initio structure prediction algorithm on a real protein, we will use a software resource called QUARK, which is built upon the ideas discussed in the previous section, with some added features. For example, QUARK’s algorithm applies a combination of multiple scoring functions to look for the lowest energy conformation across all of these functions.Levinthal’s paradox means that the search space of all possible structures for a protein is so large that accurately predicting large protein structures with ab initio modeling remains very difficult. As such, QUARK limits us to proteins with at most 200 amino acids, and so we will run it only on human hemoglobin subunit alpha.Visit tutorialToward a faster approach for protein structure predictionThe figure below shows the top five predicted human hemoglobin subunit alpha structures returned by QUARK as well as the protein’s experimentally verified structure, and an average of these six structures. It takes a keen eye to see any differences between these structures. We conclude that ab initio prediction can be accurate.The experimentally verified protein structure of human hemoglobin subunit alpha (top left) along with five models of this protein produced by QUARK from the protein’s primary sequence, all of which are nearly indistinguishable from the verified structure with the naked eye.Yet we also wonder if we can speed up our structure prediction algorithms so that they will scale to a larger protein like the SARS-CoV-2 spike protein. In the next lesson, we will learn about another type of protein structure prediction that uses a database of known structures.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Protein Structure Comparison",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/accuracy",
        "date"     : "",
        "content"  : "In this lesson, we will compare the results of the SARS-CoV-2 spike protein prediction from the previous lesson against each other and against the protein’s empirically validated structure. To do so, we need a method of comparing two structures.Comparing two shapes with the Kabsch algorithmComparing two protein structures is intrinsically similar to comparing two shapes, such as those shown in the figure below.STOP: Consider the two shapes in the figure below. How similar are they?If you think you have a good handle on comparing the above two shapes, then it is because humans have very highly evolved eyes and brains. As we will see in the next module, training a computer to detect and differentiate objects is more difficult than you think!We would like to develop a distance function d(S, T) quantifying how different two shapes S and T are. If S and T are the same, then d(S, T) should be equal to zero; the more different S and T become, the larger d should become.You may have noticed that the two shapes in the preceding figure are, in fact, identical. To demonstrate that this is true, we can first move the red shape to superimpose it over the blue shape, then flip the red shape, and finally rotate it so that its boundary coincides with the blue shape, as shown in the animation below. In general, if a shape S can be translated, flipped, and/or rotated to produce shape T, then S and T are the same shape, and so d(S, T) should be equal to zero. The question is what d(S, T) should be if S and T are not the same shape.We can transform the red shape into the blue shape by translating it, flipping it, and then rotating it.Our idea for defining d(S, T), then, is first to translate, flip, and rotate S so that it resembles T “as much as possible” to give us a fair comparison. Once we have done so, we should devise a metric to quantify the difference between the two shapes that will represent d(S, T).We first translate S to have the same center of mass (or center of mass) as T. The center of mass of S is found at the point (xS, yS) such that xS and yS are the respective averages of the x-coordinates and y-coordinates on the boundary of S. The center of mass of some shapes can be determined mathematically. But for irregular shapes, we will first sample n points from the boundary of S and then estimate xS and yS as the average of all the respective x- and y-coordinates from the sampled points.Next, imagine that we have found the desired rotation and flip of S that makes it resemble T as much as possible; we are now ready to define d(S, T) in the following way. We sample n points along the boundary of each shape, converting S and T into vectors s = (s1, …, sn) and t = (t1, …, tn), where si is the i-th point on the boundary of S. The root mean square deviation (RMSD) between the two vectors is the square root of the average squared distance between corresponding points,\[\text{RMSD}(\mathbf{s}, \mathbf{t}) = \sqrt{\dfrac{1}{n} \cdot [d(s_1, t_1)^2 + d(s_2, t_2)^2 + \cdots + d(s_n, t_n)^2]} \,.\]In this formula, d(si, ti) is the distance between the points si and ti.Note: RMSD is a common approach across data science when measuring the differences between two vectors.For an example two-dimensional RMSD calculation, consider the figure below, which shows two shapes with four points sampled from each. (For simplicity, the shapes do not have the same center of mass.)Two shapes with four points sampled from each.The distances between corresponding points in this figure are equal to \(\sqrt{2}\), 1, 4, and \(\sqrt{2}\). As a result, we compute the RMSD as\[\begin{align*}\text{RMSD}(\mathbf{s}, \mathbf{t}) &amp; = \sqrt{\dfrac{1}{4} \cdot (\sqrt{2}^2 + 1^2 + 2^2 + \sqrt{2}^2)} \\&amp; = \sqrt{\dfrac{1}{4} \cdot 9}\\&amp; = \sqrt{\dfrac{9}{4}}\\&amp; = \dfrac{3}{2}\end{align*}\]STOP: Do you see any issues with using RMSD to compare two shapes?Even if we assume that two shapes have already been overlapped and rotated appropriately, we still should ensure that we sample enough points to give a good approximation of how different the shapes are. Consider a circle inscribed within a square, as shown in the figure below. If we happened to sample only the four points indicated in this figure, then we would sample the same points for each shape and conclude that the RMSD between these two shapes is equal to zero. Fortunately, this issue is easily resolved by making sure to sample enough points from the shape boundaries.A circle inscribed within a square. Sampling of only the four points where the shapes intersect will give an RMSD of zero, a flawed estimate for the distance between the two shapes.However, we have still assumed that we already rotated (and possibly flipped) S to be as “similar” to T as possible. In practice, after superimposing S and T to have the same center of mass, we would like to find the flip and/or rotation of S that minimizes the RMSD between our vectorizations of S and T over all possible ways of flipping and rotating S. It is this minimum RMSD that we define as d(S, T).The best way of rotating and flipping S so as to minimize the RMSD between the resulting vectors s and t can be found with a method called the Kabsch algorithm. Explaining this algorithm requires some advanced linear algebra and is beyond the scope of our work but is described here.PDB format represents a protein’s structureThe Kabsch algorithm offers a compelling way to determine the similarity of two protein structures. We can convert a protein containing n amino acids into a vector of length n by selecting a single representative point from each amino acid. For this representative point, we typically use the alpha carbon, the amino acid’s centrally located carbon atom.Whether a protein structure is experimentally validated or predicted by an algorithm, the structure is often represented in a unified file format used by the PDB called .pdb format. In this format (see the figure below), each atom in the protein is labeled according to several different characteristics, including:  the element from which the atom derives;  the amino acid in which the atom is contained;  the chain on which this amino acid is found;  the position of the amino acid within this chain; and  the 3D coordinates (x, y, z) of the atom in angstroms (10-10 meters).Lines 2,159 to 2,175 of the .pdb file for the experimentally verified SARS-CoV-2 spike protein structure, PDB entry 6vxx. These 17 lines contain information on the atoms taken from two amino acids, alanine and tyrosine. The rows corresponding to these amino acids’ alpha carbons are shown in green and appear as “CA” under the “Element” column. Column labels are as follows: “Index” refers to the number of the amino acid; “Element” identifies the chemical element to which this atom corresponds; “Chain” indicates which chain on which the atom is found; “Position” identifies the position in the protein of the amino acid from which the atom is taken; “Coordinates” indicates the x, y, and z coordinates of the atom’s position (in angstroms).Note: The above figure shows just part of the information needed to fully represent a protein structure. For example, a .pdb file will also contain information about the disulfide bonds between amino acids. For more information, consult the official PDB documentation).The Kabsch algorithm can be fooledAlthough the Kabsch algorithm is powerful, we should be careful when applying it. Consider the figure below, which shows two toy protein structures; the orange structure (S) is identical to the blue structure (T) except for a change in a single bond angle between the third and fourth amino acids. And yet this tiny change in the protein’s structure causes a significant increase in d(si, ti) for every i greater than 3, which inflates the RMSD.(Top) Two hypothetical protein structures that differ in only a single bond angle between the third and fourth amino acids, shown in red. Each circle represents an alpha carbon. (Bottom left) Superimposing the first three amino acids shows how much the change in the bond angle throws off the computation of RMSD by increasing the distances between corresponding alpha carbons. (Bottom right) The Kabsch algorithm would align the centers of gravity of the two structures in order to minimize RMSD between corresponding alpha carbons. This alignment belies the similarity in the structures and makes it difficult for the untrained observer to notice the proteins’ similarity.Another way in which the Kabsch algorithm can be tricked is in the case of an appended substructure that throws off the ordering of the amino acids. The following figure shows a toy example of a structure into which we incorporate a loop, thus throwing off the natural order of comparing amino acids. (The same effect is caused if one or more amino acids are deleted from one of the two proteins.)Two toy two protein structures, one of which includes a loop of three amino acids. After the loop, each amino acid in the orange structure will be compared against an amino acid that occurs farther long in the blue structure, thus increasing d(si, ti)2 for each such amino acid.To address this second issue, biologists often first align the sequences of two proteins, discarding any amino acids that do not align before performing a vectorization of structures for the RMSD calculation. We will soon see an example of a protein sequence alignment when comparing the coronavirus spike proteins.In short, if the RMSD of two proteins is large, then we should be wary of concluding that the proteins are very different, and we may need to combine RMSD with other methods of structure comparison. But if the RMSD is small (e.g., just a few angstroms), then we can have confidence that the proteins are indeed similar.We are now ready to apply the Kabsch algorithm to compare the structures that we predicted for human hemoglobin subunit alpha and the SARS-CoV-2 spike protein against their respective experimentally validated structures.Visit tutorialAssessing the accuracy of our structure prediction modelsIn the tutorials occurring earlier in this module, we used protein structure prediction software to predict the structure of human hemoglobin subunit alpha (using ab initio modeling) and the SARS-CoV-2 spike protein (using homology modeling). We will now see how well our models performed by showing the values of RMSD produced by the Kabsch algorithm when comparing each of these models against the validated structures.Ab initio (QUARK) models of Human Hemoglobin Subunit AlphaThe table below shows the RMSD between each of the five predicted structures returned by QUARK and the validated structure of human hemoglobin subunit alpha (PDB entry: 1si4). We are tempted to conclude that our ab initio prediction was a success. However, because human hemoglobin subunit alpha is so short (141 amino acids), researchers would consider this RMSD score to be high.            Quark Model      RMSD                  QUARK1      1.58              QUARK2      2.0988              QUARK3      3.11              QUARK4      1.9343              QUARK5      2.6495      It is tempting to conclude that our ab initio prediction was a success. However, because human hemoglobin subunit alpha is such a short protein (141 amino acids), researchers would consider this RMSD score high.Homology models of the SARS-CoV-2 spike proteinIn the homology tutorial, we used SWISS-MODEL and Robetta to predict the structure of the SARS-CoV-2 spike protein, and we used GalaxyWeb to predict the structure of this protein’s receptor binding domain (RBD).GalaxyWEBFirst, we consider the five GalaxyWEB models produced for the spike protein RBD. The following table shows the RMSD between each of these models and the validated SARS-CoV-2 RBD (PDB entry: 6lzg).            GalaxyWEB      RMSD                  Galaxy1      0.1775              Galaxy2      0.1459              Galaxy3      0.1526              Galaxy4      0.1434              Galaxy5      0.1202      All of these models have an excellent RMSD score and can be considered very accurate. Note that their RMSD is more than an order of magnitude lower than the RMSD computed for our ab initio model of hemoglobin subunit alpha, despite the fact that the RBD is longer (229 amino acids).SWISS-MODELWe now shift to homology models of the entire spike protein and start with SWISS-MODEL. The following table shows the RMSD between each of three structures produced by SWISS-MODEL and the validated structure of the SARS-CoV-2 spike protein (PDB entry: 6vxx).            SWISS MODEL      RMSD                  SWISS1      5.8518              SWISS2      11.3432              SWISS3      11.3432      The first structure has a lowest RMSD over the three models, and even though its RMSD (5.818) is significantly higher than what we saw for the GalaxyWEB prediction for the RBD, keep in mind that the spike protein is 1281 amino acids long, and so the sensitivity of RMSD to slight changes should give us confidence that our models are on the right track.RobettaFinally, we produced five predicted structures of a single chain of the SARS-CoV-2 spike protein using Robetta. The following table compares each of them against the validated structure of the SARS-CoV-2 spike protein (PDB: 6vxx).            Robetta      RMSD                  Robetta1      3.1189              Robetta2      3.7568              Robetta3      2.9972              Robetta4      2.5852              Robetta5      12.0975      STOP: Which do you think performed more accurately on our predictions: SWISS-MODEL or Robetta?Distributing protein structure prediction around the worldWhile some researchers were working to elucidate the structure of the SARS-CoV-2 spike protein experimentally, thousands of users were devoting their computers to the cause of predicting the protein’s structure computationally. Two leading structure prediction projects, Rosetta@home and Folding@home, encourage volunteers to download their software and contribute to a gigantic distributed effort to predict protein shape. Even with a modest laptop, a user can donate some of their computer’s idle resources to help predict protein structure.The results of the SSGCID models of the S protein released by Rosetta@Home are shown below.            SSGCID      RMSD (Full Protein)      RMSD (Single Chain)                  SSGCID1      3.505      2.7843              SSGCID2      2.3274      2.107              SSGCID3      2.12      1.866              SSGCID4      2.0854      2.047              SSGCID5      4.9636      4.6443      As we might expect due to having access to thousands of users’ computers, the SSGCID models outperform our SWISS-MODEL models. Yet a typical threshold for whether a predicted structure is accurate is if, when compared to a validated structure, its RMSD is smaller than 2.0 angstroms, a test that the models in the above table do not pass.The inability of even powerful models to obtain an accurate predicted structure for the SARS-CoV-2 spike protein may make it seem that protein structure prediction is a lost cause. Perhaps biochemists should head back to their expensive experimental validations and ignore the musings of computational scientists. In the conclusion to part 1, we will find hope.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Methylation Helps a Bacterium Adapt to Differing Concentrations",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/adaptation",
        "date"     : "",
        "content"  : "Bacterial tumbling frequencies remain constant for different attractant concentrationsIn the previous lesson, we explored the signal transduction pathway by which E. coli can change its tumbling frequency in response to a change in the concentration of an attractant. But the reality of cellular environments is that the concentration of an attractant can vary across several orders of magnitude. The cell therefore needs to detect not absolute concentrations of an attractant but rather relative changes.E. coli detects relative changes in its concentration via adaptation to these changes. If the concentration of attractant remains constant for a period of time, then regardless of the absolute value of the concentration, the cell returns to the same background tumbling frequency. That is, E. coli demonstrates robustness to the attractant concentration in maintaining its default tumbling behavior.However, our current model is not able to address this adaptation. If the ligand concentration increases in the model, then phosphorylated CheY will plummet and remain at a low steady state.In this lesson, we will investigate the biochemical mechanism that E. coli uses to achieve such a robust response to environments with different background concentrations. We will then expand the model we built in the previous lesson to replicate the bacterium’s adaptive response.Bacteria remember past concentrations using methylationRecall that in the absence of an attractant, CheW and CheA readily bind to an MCP, leading to greater autophosphorylation of CheA, which in turn phosphorylates CheY. The greater the concentration of phosphorylated CheY, the more frequently the bacterium tumbles.Signal transduction is achieved through phosphorylation, but E. coli maintains a “memory” of past environmental concentrations through a chemical process called methylation. In this  reaction, a methyl group (-CH3) is added to an organic molecule; the removal of a methyl group is called demethylation.Every MCP receptor contains four methylation sites, meaning that between zero and four methyl groups can be added to the receptor. On the plasma membrane, many MCPs, CheW, and CheA molecules form an array structure. Methylation reduces the negative charge on the receptors, stabilizing the array and facilitating CheA autophosphorylation. The more sites that are methylated, the higher the autophosphorylation rate of CheA, which means that CheY has a higher phosphorylation rate, and tumbling frequency increases.We now have two different ways that tumbling frequency can be elevated. First, if the concentration of an attractant is low, then CheW and CheA freely form a complex with the MCP, and the phosphorylation cascade passes phosphoryl groups to CheY, which interacts with the flagella and keeps tumbling frequency high. Second, an increase in MCP methylation can also boost CheA autophosphorylation and lead to increased tumbling frequency.Methylation of MCPs is achieved by an additional protein called CheR. When bound to MCPs, CheR methylates ligand-bound MCPs faster12, and so the rate of MCP methylation by CheR is higher if the MCP is bound to a ligand.3. Let’s consider how this fact affects a bacterium’s behavior.Say that E. coli encounters an increase in attractant concentration. Then the lack of a phosphorylation cascade will mean less phosphorylated CheY, and so the tumbling frequency will decrease. However, if the attractant concentration levels off, then the tumbling frequency will flatten, while CheR starts methylating the MCP. Over time, the rising methylation will increase CheA autophosphorylation, bringing back the phosphorylation cascade and raising tumbling frequency back to default levels.Just as the phosphorylation of CheY can be reversed, MCP methylation can be undone to prevent methylation from being permanent. In particular, an enzyme called CheB, which like CheY is phosphorylated by CheA, demethylates MCPs (as well as autodephosphorylates). The rate of an MCP’s demethylation is dependent on the extent to which the MCP is methylated. In other words, the rate of MCP methylation is higher when the MCP is in a low methylation state, and the rate of demethylation is faster when the MCP is in a high methylation state.3The figure below adds CheR and CheB to provide a complete picture of the core pathways influencing chemotaxis. To model these pathways and see how our simulated bacterial system responds to different relative attractant concentrations, we will need to add quite a few molecules and reactions to our current model.The chemotaxis signal-transduction pathway with methylation included. CheA phosphorylates CheB, which methylates MCPs, while CheR demethylates MCPs. Blue lines denote phosphorylation, grey lines denote dephosphorylation, green arrows denote methylation, and red arrows denote demethlyation. Image modified from Parkinson Lab’s illustrations.Combinatorial explosion and the need for rule-based modelingTo expand our model, we will need to include methylation of the MCP by CheR and demethylation of the MCP by CheB. For simplicity, we will use three methylation levels (low, medium, and high) rather than five.Imagine that we were attempting to specify every reaction that could take place in our model. To specify an MCP, we would need to establsh whether it is bound to a ligand (two possible states), whether it is bound to CheR (two possible states), whether it is phosphorylated (two possible states), and which methylation state it is in (three possible states). Therefore, a given MCP would need 2 · 2 · 2 · 3 = 24 total states.Consider the simple reaction of a ligand binding to an MCP, which we originaly wrote as T + L  TL. We now need this reaction to include 12 of the 24 states, the ones corresponding to the MCP being unbound to the ligand. Our previously simply reaction would become, 12 different reactions, one for each possible unbound state of the complex molecule T. And if the situation were just a little more complex, with the ligand molecule L having n possible states, then we would have 12n reactions. Image trying to debug a model in which we had accidentally incorporated a type when transcribing just one of these reaction!In other words, as our model grows, with multiple different states for each molecule involved in each reaction, the number of reactions that we need to represent the system grows rapidly; this phenomenon is called combinatorial explosion and means that building realistic models of biochemical systems at scale can be daunting.Yet all these 12 reactions can be summarized with a single rule: a ligand and a receptor can bind into a complex if the receptor is unbound. Moreover, all 12 reactions implied by the rule are easily inferable from it. This example illustrates rule-based modeling, a paradigm applied by BioNetGen in which a potentially enormous number of reactions are specified by a much smaller collection of “rules” from which all reactions can be inferred.We will not bog down the main text with a full specification of all the rules needed to add methylation to our model while avoiding combinatorial explosion. If you are interested in the details, please follow our tutorial.Visit tutorialBacterial tumbling is robust to large sudden changes in attractant concentrationIn the figures that follow, we plot the concentration over time of each molecule for different values of l0, the initial concentration of ligand. From what we have learned about E. coli, we should see the concentration of phosphorylated CheY (and therefore the bacterium’s tumbling frequency) drop before returning to its original equilibrium. But will our simulation capture this behavior?First, we add a relatively small amount of attractant, setting l0 equal to 10,000. The system returns so quickly to an equilibrium in phosphorylated CheY that it is difficult to imagine that the attractant has had any effect on tumbling frequency.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 10,000 initial attractant ligand particles.If instead l0 is equal to 100,000, then we obtain the figure below. After an initial drop in the concentration of phosphorylated CheY, it returns to equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100,000 initial attractant ligand particles.When we increase l0 by another factor of ten to 1 million, the initial drop is more pronounced, but the system returns just as quickly to equilibrium. Note how much higher the concentration of methylated receptors are in this figure compared to the previous figure; however, there are still a significant concentration of receptors with low methylation, indicating that the system may be able to handle an even larger jolt of attractant.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with one million initial attractant ligand particles.When we set l0 equal to 10 million, we give the system this bigger jolt. Once again, the model returns to its previous CheY equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with ten million initial attractant ligand particles.Finally, with l0 equal to 100 million, we see what we might expect: the steepest drop in phosphorylated CheY yet, but a system that is able to return to equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100 million initial attractant ligand particles.Our model, which is built on real reaction rate parameters, provides compelling evidence that the E. coli chemotaxis system is robust to changes in its environment across several orders of magnitude of attractant concentration. This robustness has been observed in real bacteria45, as well as replicated by other computational simulations6.Aren’t bacteria magnificent?Traveling up an attractant gradientWe have simulated E. coli adapting to a single sudden change in its environment, but life often depends on responding to continual change. Imagine a glucose cube in an aqueous solution. As the cube dissolves, a gradient will form, with a glucose concentration that decreases outward from the cube. How will the tumbling frequency of E. coli change if the bacterium finds itself in an environment of an attractant gradient?  Will the tumbling frequency decrease continuously as well, or will we see more complicated behavior? And once the cell reaches a region of high attractant concentration, will its default tumbling frequency stabilize to the same steady state?We will modify our model by increasing the concentration of the attractant ligand exponentially and seeing how the concentration of phosphorylated CheY changes. This model will simulate a bacterium traveling up an attractant gradient toward an attractant. Moreover, we will examine how the concentration of phosphorylated CheY changes as we change the gradient’s “steepness”, or the rate at which attractant ligand is increasing. Visit the following tutorial if you’re interested in following our adjustments for yourself.Visit tutorialSteady state tumbling frequency is robustTo model a ligand concentration [L] that is increasing exponentially, we will use the function [L] = l0 · ek · t, where e is Euler’s number (e = 2.71828…) l0 is the initial ligand concentration, k is a constant dictating the rate of exponential growth, and t is time. The parameter k represents the steepness of the gradient, since the higher the value of k, the faster the growth in the ligand concentration [L].For example, the following figure shows the concentration over time of phosphorylated CheY (orange) when l0 = 1000 and k = 0.1. The concentration of phosphorylated CheY, and therefore the tumbling frequency, still decreases sharply as the ligand concentration increases, but after all ligands become bound to receptors (the plateau in the blue curve), receptor methylation causes the concentration of phosphorylated CheY to return to its equilibrium. For these values of l0 and k, the outcome is similar to when we provided an instantaneous increase in ligand, although the cell takes longer to reach its minimum concentration of phosphorylated CheY because the attractant concentration is increasing gradually.Plots of relevant molecule concentrations in our model (in number of molecules in the cell) over time (in seconds) when the concentration of ligand grows exponentially with l0 = 1000 and k = 0.1. The concentration of bound ligand (shown in red) quickly hits saturation, which causes a minimum in phosphorylated CheY (orange), and therefore a low tumbling frequency. To respond, the cell increases the methylation of receptors, which boosts the concentration of phosphorylated CheY back to equilibrium.The following figure shows the results of multiple simulations in which we vary the growth parameter k and plot only the concentration of phosphorylated CheY over time. The larger the value of k, the faster the increase in receptor binding, and the steeper the drop in the concentration of phosphorylated CheY.Plots of the concentration of phosphorylated CheY over time (in seconds) for different growth rates k of ligand concentration. The larger the value of k, the steeper the initial drop in the concentration of phosphorylated CheY, and the faster that methylation returns the concentration of phosphorylated CheY to equilibrium. The same equilibrium is obtained regardless of the value of k.The above figure further illustrates the robustness of bacterial chemotaxis to the rate of growth in ligand concentration. Whether the growth of the attractant is slow or fast, methylation will always bring the cell back to the same equilibrium concentration of phosphorylated CheY and therefore the same background tumbling frequency.From changing tumbling frequencies to an exploration algorithmWe hope that our work here has conveyed the elegance of bacterial chemotaxis, as well as the power of rule-based modeling and the Gillespie algorithm for simulating a complex biochemical system that may include a huge number of reactions.And yet we are missing an important part of the story. E. coli has evolved to ensure that if it detects a relative increase in concentration (i.e., an attractant gradient), then it can reduce its tumbling frequency in response. But we have not explored why changing its tumbling frequency would help a bacterium find food in the first place. After all, according to the run and tumble model, the direction that a bacterium is moving at any point in time is random!This quandary does not have an obvious intuitive answer. In this module’s conclusion, we will build a model to explain why E. coli’s randomized run and tumble walk algorithm is a clever way of locating resources in an unfamiliar land.Next lessonAdditional resourcesIf you find chemotaxis biology as interesting as we do, then we suggest the following resources.  An amazing introduction to chemotaxis from the  Parkinson Lab.  A good overview of chemotaxis: Webre et al. 2003   A review article on chemotaxis pathway and MCPs:  Baker et al. 2005.  A more recent review article of MCPs: Parkinson et al. 2015.  Lecture notes on robustness and integral feedback: Berg 2008.            Amin DN, Hazelbauer GL. 2010. Chemoreceptors in signaling complexes: shifted conformation and asymmetric coupling. Available online &#8617;              Terwilliger TC, Wang JY, Koshland DE. 1986. Kinetics of Receptor Modification - the multiply methlated aspartate receptors involved in bacterial chemotaxis. The Journal of Biolobical Chemistry. Available online &#8617;              Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. &#8617; &#8617;2              Shimizu TS, Delalez N, Pichler K, and Berg HC. 2005. Monitoring bacterial chemotaxis by using bioluminescence resonance energy transfer: absence of feedback from the flagellar motors. PNAS. Available online &#8617;              Krembel A., Colin R., Sourijik V. 2015. Importance of multiple methylation sites in Escherichia coli chemotaxis. Available online &#8617;              Bray D, Bourret RB, Simon MI. 1993. Computer simulation of phosphorylation cascade controlling bacterial chemotaxis. Molecular Biology of the Cell. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Anisotropic Network Models",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/anm",
        "date"     : "",
        "content"  : "ANMs account for the direction of protein fluctuationsThe generalization of a Gaussian network model, in which we attempt to determine the directions of the forces influencing alpha carbons, is called an anisotropic network model (ANM). Although ANMs include directionality, they typically perform worse than GNMs when benchmarked against experimental data1. However, ANMs can be used to create animations depicting the range of motions and fluctuations of the protein, as well as to estimate the specific directions of movement caused by each of a protein’s modes.We will not delve into the mathematical intricacies of ANM calculations, but we will use ANMs to create animations visualizing protein fluctuations. For example, click on the animation below to see a video of estimated hemoglobin fluctuations produced from ANM. We can see that the left and right side of the protein are more flexible than the rest of the protein and twist in opposite directions.Collective motions of the slowest mode in human hemoglobin from ANM calculations using DynOmics.After we produce an animation like the one in the figure above, we also should attempt to explain it biologically. Human hemoglobin exists in two states: the tense state (T), in which it is not bound to an oxygen molecule, and the relaxed state (R), in which it is oxygenated. Hemoglobin’s mobility shown in the above animation corresponds to its ability to transition between these two states, in which salt-bridges and contacts can shift by up to seven angstroms2. This significant molecular flexibility exemplifies why we need to study protein dynamics as well as structure.We will now apply ANM to the SARS-CoV and SARS-CoV-2 spike proteins. We will also use NMWiz, which is short for “normal mode wizard”, to perform ANM calculations and create an animation of the SARS-CoV-2 (chimeric) RBD and the SARS-CoV RBD.Visit tutorialNote: Although we have separated our discussion of GNM and ANM, the DynOmics 1.0 server is an effort to integrate these approaches on a user-friendly platform. If you are interested, an additional tutorial shows how to use DynOmics to replicate some of the analysis below.ANM analysis of the coronavirus binding domainThe following two animations show the complex of each virus’s RBD (purple) bound with ACE2 (green). The following tables indicate the color-coding of each animation.SARS-CoV spike protein RBD (PDB: 2ajf)            SARS-CoV RBD      Purple                  Resid 463 to 472 (Loop)      Yellow              Resid 442 (Hotspot 31)      Orange              Resid 487 (Hotspot 353)      Red                  ACE2      Green              Resid 79, 82, 83 (Loop)      Silver              Resid 31, 35 (Hotspot 31)      Orange              Resid 38, 353 (Hotspot 353)      Red        SARS-CoV-2 spike protein chimeric RBD (PDB: 6vw1)            SARS-CoV-2 (Chimeric) RBD      Purple                  Resid 476 to 486 (Loop)      Yellow              Resid 455 (Hotspot 31)      Blue              Resid 493 (Hotspot 31)      Orange              Resid 501 (Hotspot 353)      Red                  ACE2      Green              Resid 79, 82, 83 (Loop)      Silver              Resid 31, 35 (Hotspot 31)      Orange              Resid 38, 353 (Hotspot 353)      Red        Recall from the previous lesson that the greatest contribution of negative energy to the RBD/ACE2 complex in SARS-CoV-2 was the region called “hotspot 31”, which is highlighted in blue and orange in the above figures. If you look closely, as the protein swings in to bind with ACE2, the blue and orange regions appear to line up just a bit more naturally in the SARS-CoV-2 animation than in the SARS-CoV animation. That is, the improved binding that we hypothesized for a static structure appears to be confirmed by dynamics simulations. This animation provides yet one more piece of evidence that SARS-CoV-2 is more effective than SARS-CoV at binding to the ACE2 enzyme.Next lesson            Yang, L., Song, G., Jernigan, R. 2009. Protein elastic network models and the ranges of cooperativity. PNAS 106(30), 12347-12352. https://doi.org/10.1073/pnas.0902159106 &#8617;              Davis, M., Tobi, D. 2014. Multiple Gaussian network modes alignments reveals dynamically variable regions: The hemoglobin case. Proteins: Structure, Function, and Bioinformatics, 82(9), 2097-2105. https://doi-org.cmu.idm.oclc.org/10.1002/prot.24565 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Gene Autoregulation is Surprisingly Frequent",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/autoregulation",
        "date"     : "",
        "content"  : "Using randomness to determine statistical significanceIn the previous lesson, we introduced the transcription factor network, in which a protein X is connected to a protein Y if X is a transcription factor that regulates the production of Y. We also saw that in the E. coli transcription factor network, there seemed to be a large number of loops, or edges connecting some transcription factor X to itself, and which indicate the autoregulation of X.In the introduction, we briefly referenced the notion of a network motif, a structure occurring often throughout a network. Our assertion is that the loop is a motif in the transcription factor network; how can we defend this claim?To argue that a loop is indeed a motif in the E. coli transcription factor network, we will apply a paradigm that occurs throughout computational biology (and science in general) when determining whether an observation is statistically significant. We will compare our observation against a  randomly generated dataset. Without getting into the statistical details, if an observation is frequent in a real dataset, and rare in a random dataset, then it is likely to be statistically significant. Randomness saves the day again!STOP: How can we apply this paradigm of a randomly generated dataset to determine whether a transcription factor network contains a significant number of loops?Comparing a real transcription factor network against a random networkTo determine whether the number of loops in the transcription factor network of E. coli is statistically significant, we will compare this number against the expected number of loops we would find in a randomly generated transcription factor network. If the former is much higher than the latter, then we have strong evidence that some selective force is causing gene autoregulation.To generate a random network, we will use an approach developed by Edgar Gilbert in 19591. Given an integer n and a probability p (between 0 and 1), we form n nodes. Then, for every possible pair of nodes X and Y, we connect X to Y via a directed edge with probability p; that is, we simulate the process of flipping a weighted coin that has probability p of coming up “heads”.Note: Simulating a weighted coin flip amounts to generating a “random” number x between 0 and 1, and considering it “heads” if x is less than p and “tails” otherwise. For more details on random number generation, consult Programming for Lovers).STOP: What should n and p be if we are generating a random network to compare against the E. coli transcription factor network?The full E. coli transcription factor network contains thousands of genes, most of which are not transcription factors. As a result, the approach described above may form a random network that connects non-transcription factors to other nodes, which we should avoid.Instead, we will focus on the network comprising only those E. coli transcription factors that regulate each other. This network has 197 nodes and 477 edges, and so we will begin by forming a random network with n = 197 nodes.We then select p to ensure that our random network will on average have 477 edges. To do so, we note that there are n2 pairs of nodes that could have an edge connecting them (n choices for the starting node and n for the ending node). If we were to set p equal to 1/n2, then we would expect on average only to see a single edge in the random network. We therefore scale this value by 477 and set p equal to 477/n2  0.0123 so that we will see, on average, 477 edges in our random network.In the following tutorial, we write some code to count the number of loops in the real E. coli transcription factor network. We then build a random network and compare the number of loops found in this network against the number of loops in the real network.Visit tutorialThe negative autoregulation motifIn a random network containing n nodes, the probability that a given edge is a loop is 1/n. Therefore, if the network has e edges, then we would on average see e/n loops in the network. In our case, n is 197 and e is 477; therefore, on average, we expect to see 197/497  2.42 loops in a random network. Yet the real network of E. coli transcription factors that regulate each other contains 130 loops!Furthermore, in a random network, we would expect that about half of the edges correspond to activation, and the other half correspond to repression. But if you followed the preceding tutorial, then you know that of the 130 loops in the E. coli network, 35 correspond to activation and 95 correspond to repression. Just as you would be surprised to flip a coin 130 times and see “heads” 95 times, the cell must be negatively autoregulating for some reason.Not only is autoregulation an important feature of transcription factors, but these transcription factors tend to negatively autoregulate. Why in the world would organisms have evolved the process of autoregulation only for a transcription factor to slow down its own transcription? In the next lesson, we will begin to unravel the mystery.Next lesson            Gilbert, E.N. (1959). “Random Graphs”. Annals of Mathematical Statistics. 30 (4): 1141–1144. doi:10.1214/aoms/1177706098. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Protein Biochemistry",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/biochemistry",
        "date"     : "",
        "content"  : "The four levels of protein structureProtein structure is a broad term that encapsulates four different levels of description. A protein’s primary structure refers to the amino acid sequence of the polypeptide chain. The primary structure of human hemoglobin subunit alpha can be downloaded here, and the primary structure of the SARS-CoV-2 spike protein, which we showed earlier, can be downloaded here.A protein’s secondary structure describes its highly regular, repeating intermediate substructures that form before the overall protein folding process completes. The two most common such substructures, shown in the figure below, are alpha helices and beta sheets. Alpha helices occur when nearby amino acids wrap around to form a tube structure; beta sheets occur when nearby amino acids line up side-by-side.The general shape of alpha helices (left) and beta sheets (right), the two most common protein secondary structures. Source: Cornell, B. (n.d.). https://ib.bioninja.com.au/higher-level/topic-7-nucleic-acids/73-translation/protein-structure.htmlA protein’s tertiary structure describes its final 3D shape after the polypeptide chain has folded and is chemically stable. Throughout this module, when discussing the “shape” or “structure” of a protein, we are almost exclusively referring to its tertiary structure. The figure below shows the tertiary structure of human hemoglobin subunit alpha. For the sake of simplicity, this figure does not show the position of every atom in the protein but rather represents the protein shape as a composition of secondary structures.The tertiary structure of human hemoglobin subunit alpha. Within the structure are multiple alpha helix secondary structures. Source: https://www.rcsb.org/structure/1SI4.Finally, some proteins have a quaternary structure, which describes the protein’s interaction with other copies of itself to form a single functional unit, or a multimer. Many proteins do not have a quaternary structure and function as an independent monomer. The figure below shows the quaternary structure of hemoglobin, which is a multimer consisting of two alpha subunits and two beta subunits.The quaternary structure of human hemoglobin, which consists of two alpha subunits (shown in red) and two beta subunits (shown in blue). Source: https://commons.wikimedia.org/wiki/File:1GZX_Haemoglobin.png.As for coronaviruses, the spike protein is a homotrimer, meaning that it is formed of three essentially identical units called chains, each one translated from the corresponding region of the coronavirus’s genome; these chains are colored differently in the figure below. In this module, when discussing the structure of the spike protein, we often are referring to the structure of a single chain.A side and top view of the quaternary structure of the SARS-CoV-2 spike protein homotrimer, with its three chains highlighted using different colors.The structural units making up proteins are often hierarchical, and the spike protein is no exception. Each spike protein chain is a dimer, consisting of two subunits called S1 and S2. Each of these subunits further divides into protein domains, distinct structural units within the protein that fold independently and are typically responsible for a specific interaction or function. For example, the SARS-CoV-2 spike protein has a receptor binding domain (RBD) located on the S1 subunit of the spike protein that is responsible for interacting with the human ACE2 enzyme; the rest of the protein does not come into contact with ACE2. We will say more about the RBD soon.Proteins seek the lowest energy conformationNow that we know a bit more about how protein structure is defined, we will discuss why proteins fold in the same way every time. In other words, what are the factors driving the magic algorithm?Amino acids’ side chain variety causes them to have different chemical properties, which can lead to some protein conformations being more chemically “preferable” than others. For example, the table below groups the twenty amino acids commonly occurring in proteins according to their chemical properties. Nine of these amino acids are hydrophobic (also called nonpolar), meaning that their side chains are repelled by water, and so we tend to find these amino acids tucked away on the interior of the protein.A chart of the twenty amino acid grouped by chemical properties. The side chain of each amino acid is highlighted in blue. Image courtesy: OpenStax Biology.We can therefore view protein folding as finding the tertiary structure that is the most stable given a polypeptide’s primary structure. A central theme of the previous module on bacterial chemotaxis was that a system of chemical reactions moves toward equilibrium. The same principle is true of protein folding; when a protein folds into its final structure, it reaches a conformation that is as chemically stable as possible.To be more precise, the potential energy (sometimes called free energy) of a molecule is the energy stored within an object due to its position, state, and arrangement. In molecular mechanics, the potential energy is made up of the sum of bonded energy and non-bonded energy.As the protein bends and twists into a stable conformation, bonded energy derives from the protein’s covalent bonds, as well as the bond angles between adjacent amino acids, and the torsion angles that we introduced in the previous lesson.Non-bonded energy comprises electrostatic interactions and van der Waals interactions. Electrostatic interactions refer to the attraction and repulsion force from the electric charge between pairs of charged amino acids. Two of the twenty standard amino acids (arginine and lysine) are positively charged, and two (aspartic acid and glutamic acid) are negatively charged. Two nearby amino acids of opposite charge may interact to form a salt bridge. Conformations that contain salt bridges and keep apart similarly charged amino acids will therefore have a lower free energy component contributed by electrostatic interactions.As for van der Waals interactions, atoms are dynamic systems, with electrons constantly buzzing around the nucleus, as shown in the figure below.A carbon-12 atom showing six positively charged protons (green), six neutrally charged neutrons (blue), and six negatively charged electrons (red). Under typical circumstances, the electrons will most likely be distributed uniformly around the nucleus.However, due to random chance, the negatively charged electrons in an atom could momentarily be unevenly distributed on one side of the nucleus. This uneven distribution will cause the atom to have a temporary negative charge on the side with the excess electrons and a temporary positive charge on the opposite side. As a result of this charge, one side of the atom may attract only the oppositely charged components of another atom, creating an induced dipole in that atom in turn as shown in the figure below. Van der Waals forces refer to the attraction and repulsion between atoms because of induced dipoles.Due to random chance, the electrons in the atom on the left have clustered on the left side of the atom, creating a net negative charge on this side of the atom, and therefore a net positive charge on the right side of the atom. This polarity induces a dipole in the atom on the right, whose electrons are attracted because of van der Waals forces.As the protein folds, it seeks a conformation of lowest total potential energy based on the combination of all the above-mentioned forces. For an analogy, imagine a ball on a slope, as shown in the following figure. The ball will move down the slope unless it is pushed uphill by some outside force, making it unlikely that the ball will wind up at the top of a hill. We will keep this analogy in mind as we return to the problem of protein structure prediction.A ball on a hill offers an analogy for a protein folding into the lowest energy structure. As the ball is more likely to move down into a valley, a protein is more likely to fold into a more stable, lower energy conformation.Next lesson"
     
   } ,
  
   {
     
        "title"    : "A Biochemically Accurate Model of Bacterial Chemotaxis",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/biochemistry",
        "date"     : "",
        "content"  : "Transducing an extracellular signal to a cell’s interiorWe now turn to the question of how the cell conveys the extracellular signal it has detected via the process of signal transduction to the cell’s interior. In other words, when E. coli senses an increase in the concentration of glucose, meaning that more ligand-receptor binding is taking place at the receptor that recognizes glucose, how does the bacterium change its behavior?The engine of signal transduction is phosphorylation, a chemical reaction that attaches a phosphoryl group (PO3-) to an organic molecule.  Phosphoryl modifications serve as an information exchange of sorts because, as we will see, they activate or deactivate certain enzymes.A phosphoryl group usually comes from one of two sources. First, the phosphoryl can be broken off from an adenosine triphosphate (ATP) molecule, the “energy currency” of the cell, producing adenosine diphosphate (ADP). Second, the phosphoryl can be exchanged from a phosphorylated molecule that loses its phosphoryl group in a dephosphorylation reaction.For many cellular responses, including bacterial chemotaxis, a sequence of phosphorylation and dephosphorylation events called a phosphorylation cascade serves to transmit information within the cell about the amount of ligand binding being detected on the cell’s exterior. In this lesson, we discuss how this cascade of chemical reactions leads to a change in bacterial movement.A high-level view of the transduction pathway for chemotaxis is shown in the figure below. The cell membrane receptors that we have been working with are called methyl-accepting chemotaxis proteins (MCPs), and they bridge the cellular membrane, binding both to ligand stimuli in the cell exterior and to other proteins on the inside of the cell. The pathway also includes a number of additional proteins, which all start with the prefix “Che” (short for “chemotaxis”).A summary of the chemotaxis transduction pathway. A ligand binding signal is propagated through CheA and CheY phosphorylation, which leads to a response of clockwise flagellar rotation. The blue curved arrow denotes phosphorylation, the grey curved arrow denotes dephosphorylation, and the blue dashed arrow denotes a chemical interaction. Our figure is a simplified view of Parkinson Lab illustrations.On the interior of the cellular membrane, MCPs form complexes with two proteins called CheW and CheA. In the absence of MCP-ligand binding, this complex is more stable, and the CheA molecule autophosphorylates, meaning that it adds a phosphoryl group taken from ATP to itself  a concept that might seem mystical if you had not already followed our discussion of autoregulation in the previous module.A phosphorylated CheA protein can pass on its phosphoryl group to a molecule called CheY, which interacts with the flagellum in the following way. Each flagellum has a protein complex called the flagellar motor switch that is responsible for controlling the direction of flagellar rotation. The interaction of this protein complex with phosphorylated CheY induces a change of flagellar rotation from counter-clockwise to clockwise. As we discussed earlier in the module, this change in flagellar rotation causes the bacterium to tumble, which in the absence of an increase in attractant occurs every 1 to 1.5 seconds.Yet when a ligand binds to the MCP, the MCP undergoes conformation changes, which reduce the stability of the complex with CheW and CheA. As a result, CheA is less readily able to autophosphorylate, which means that it does not phosphorylate CheY, which cannot change the flagellar rotation to clockwise, and so the bacterium is less likely to tumble.In summary, attractant ligand binding causes more phosphorylated CheA and CheY, which means that it causes fewer flagellar interactions and therefore less tumbling, which means that it causes the bacterium to run for a longer period of time.Note: A critical part of this process is that if a cell with a high concentration of CheY detects an attractant ligand, then it needs to decrease its CheY concentration quickly. Otherwise, the cell will not be able to change its tumbling frequency. To this end, the cell is able to dephosphorylate CheY using an enzyme called CheZ.Adding phosphorylation events to our model of chemotaxisWe would like to use the Gillespie algorithm that we introduced in the previous lesson to simulate the reactions driving chemotaxis signal transduction and see what happens if the bacterium “senses an attractant”, meaning that the attractant ligand’s concentration increases and leads to more receptor-ligand binding.This model will be more complicated than any we have introduced thus far. We will need to account for both bound and unbound MCP molecules, as well as phosphorylated and unphosphorylated CheA and CheY enzymes. We will also need to model CheA phosphorylation reactions, which depend on the current concentrations of bound and unbound MCP molecules. We will at least make the simplifying assumption that the MCP receptor is permanently bound to CheA and CheW, so that we do not need to represent these molecules individually. In other words, rather than thinking about CheA autophosphorylating, we will think about the receptor that includes CheA autophosphorylating.We will need six reactions. Two reversible reactions represent ligand-receptor binding: one for phosphorylated receptors, and another for unphosphorylated receptors. Two reactions represent MCP phosphorylation and take place at different rates based on whether the MCP is bound to a ligand (in our model, the phosphorylation rate is five times greater when the MCP is unbound). One reaction represent phosphorylation of CheY, and another reaction models dephosphorylation, which is mediated by the CheZ enzyme.Once we have built this model, we would like to see what happens when we change the concentrations of the ligand. Ideally, the bacterium should be able to distinguish different ligand concentrations. That is, the higher the concentration of an attractant ligand, the lower the concentration of phosphorylated CheY, and the lower the tumbling frequency of the bacterium.But does higher attractant concentration in our model really lead to a lower concentration of CheY? Let’s find out by incorporating the phosphorylation pathway into our ligand-receptor model in the following tutorial.Visit tutorialChanging ligand concentrations leads to a change in internal molecular concentrationsThe top panel of the following figure shows the concentrations of phosphorylated CheA and CheY in a system at equilibrium in the absence of ligand. As we might expect, these concentrations remain at steady state (with some healthy noise), and so the cell stays at its background tumbling frequency. The addition of 5,000 attractant ligand molecules increases the concentration of bound receptors, therefore leading to less CheA autophosphorylation and less phosphorylated CheY (middle panel). If we instead have 100,000 initial attractant molecules, then we see an even more drastic decrease in phosphorylated CheA and CheY (bottom panel).Molecular concentrations over time (in seconds) in a chemotaxis simulation for three different initial unbound attractant ligand concentrations: no attractant ligand (top), 5,000 ligand particles (middle), and 100,000 ligand particles (bottom). Note that the simulated cell’s bound ligand concentration (green) achieves equilibrium very quickly in each case.This model, powered by the Gillespie algorithm, confirms the biological observations that an increase in attractant reduces the concentration of phosphorylated CheY. The reduction takes place remarkably quickly, with the cell attaining a new equilibrium in a fraction of a second.The biochemistry powering chemotaxis may be elegant, but it is also simple, and so perhaps it is not surprising that the model’s particle concentrations reproduced the response of E. coli to an attractant ligand.But what we have shown in this lesson is just part of the story. In the next lesson, we will see that the biochemical realities of chemotaxis are more complicated, and for good reason  this added complexity will allow E. coli, and our model of it, to react to a dynamic world with surprising sophistication.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Biological Modeling: A Short Tour",
        "category" : "",
        "tags"     : "",
        "url"      : "/buy_the_book/",
        "date"     : "",
        "content"  : "Buy the bookWe are happy to let you know that the text companion to this course, Biological Modeling: A Short Tour, is available in both print and electronic formats!  Print book: Get it from Amazon.  E-book: Get it from Leanpub.About the bookBiological Modeling: A Short Tour offers readers a deep but concise exploration of topics in modeling biological systems at multiple scales. Each chapter poses a single biological question, from why zebras have stripes, to how bacteria explore their world intelligently, to why the SARS-CoV-2 spike protein was so effective at binding to human cells. The book then introduces the modeling concepts needed to answer this question. Some talented students at Carnegie Mellon University who helped build the Biological Modeling project appear in the book as chapter co-authors.Biological Modeling: A Short Tour follows the core content of the course, whereas the tutorials powering this core content are hosted on this website.Crowdfunding the publication of our bookPublication of Biological Modeling: A Short Tour was graciously funded by several hundreds backers via Kickstarter and Indiegogo. We are eternally grateful to these supporters who brought the book to life."
     
   } ,
  
   {
     
        "title"    : "An Overview of Classification and k-Nearest Neighbors",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/classification",
        "date"     : "",
        "content"  : "The classification problem and the iris flower datasetCategorizing images of WBCs according to family is a specific instance of a ubiquitous problem in data science, in which we wish to classify each object in a given dataset into one of k groups called classes. In our ongoing example, the data are images of WBC nuclei, and the classes are the three main families of WBCs (granulocytes, lymphocytes, and monocytes). To take a different example, our data could be tumor genomes sequenced from cancer patients, which we want to classify according to the therapeutic that should be prescribed for the patient. Or the data may be the past sales behavior of shoppers, whom we wish to classify into two classes based on a prediction of whether they will buy a new product.A classical dataset commonly used for motivating classification is the iris flower dataset, which was compiled by Edgar Anderson12 and used by Ronald Fisher in a seminal paper on classification in 19363. Anderson took morphological measurements from 150 iris flowers, evenly divided over three species (see figure below).                                                                                                        Representative images of the three species of iris included in Anderson’s iris flower dataset. Image courtesies, from left to right: Emma Forsberg, unknown author, Robert H. Mohlenbrock.  Anderson measured four attributes, or features, of each flower in his dataset: the width and height of the flower’s petal, and the width and height of the flower’s sepal (a green offshoot beneath the petals). The features and species labels for twelve flowers in the iris flower dataset are shown in the table below (click here for the full dataset). Fisher noticed that flowers from the same species had similar features and wondered whether it was possible to classify the flowers according to its species using only Anderson’s four features.            Sepal length (cm)      Sepal width (cm)      Petal length (cm)      Petal width (cm)      Species                  5.1      3.5      1.4      0.2      I. setosa              4.9      3.0      1.4      0.2      I. setosa              4.7      3.2      1.3      0.2      I. setosa              4.6      3.1      1.5      0.2      I. setosa              7.0      3.2      4.7      1.4      I. versicolor              6.4      3.2      4.5      1.5      I. versicolor              6.9      3.1      4.9      1.5      I. versicolor              5.5      2.3      4.0      1.3      I. versicolor              6.3      3.3      6.0      2.5      I. virginica              5.8      2.7      5.1      1.9      I. virginica              7.1      3.0      5.9      2.1      I. virginica              6.3      2.9      5.6      1.8      I. virginica      A table containing values of the four features for twelve members of the iris flower dataset. The complete dataset was accessed from the University of California, Irvine Machine Learning Repository].STOP: What are typical feature values for flowers from each species in the table above? If presented with an iris of unknown species, how could you use these features to classify it?From flowers to vectorsIf we were to use only two of the four features in the iris flower dataset, then a flower’s feature values x and y could be represented as a point in two-dimensional space (x, y). The figure below shows such a plot for the features of petal length (x-axis) and petal width (y-axis).Petal length (x-axis) plotted against width (y-axis) for each of the flowers in the iris flower dataset, with data points colored by species. Although there were fifty flowers from each species, there are not fifty points corresponding to every species because some flowers have the same petal length and width and therefore occupy the same point.Note how stark the pattern in the above figure is. Even though we chose only two features from the iris flowers, the points associated with the flowers mostly divide into three main clusters by species.If we were to use all four features for the iris dataset, then every flower would be represented by a point in four-dimensional space. For example, the first flower in our initial table of iris features would be represented by the point (5.1, 3.5, 1.4, 0.2). In general, when classifying a collection of data with n features, each element in the dataset can be represented by a feature vector of length n, whose i-th value corresponds to the value of the data point’s i-th feature.Classifying unknown elements with k-nearest neighborsOur hope is that for datasets other than the iris flower dataset, elements from the same class will have feature vectors that are nearby in n-dimensional space. If so, then we can classify a data point whose class is unknown by determining which data points with known classification it is near.STOP: Consider the gray point with unknown class in the figure below. Should it be assigned to the class of the green points or to the class of the blue points?An unknown point (gray) along with a collection of nearby points belonging to two classes, colored green and blue.The preceding question indicates that classifying points can be surprisingly open-ended. Because of this freedom, researchers have devised a variety of different approaches for classifying data given data with known classes.We will discuss a simple but powerful classification algorithm called k-nearest neighbors, or k-NN4. In k-NN, we fix a positive integer k in advance. Then, for each point with unknown class, we assign it to the class possessed by the largest number of its k closest neighbors.In the ongoing example, if we were using k equal to 1, then we would assign the unknown point to the green class (see figure below).When k is equal to 1, k-NN classifies an unknown point according to the point of known class that is nearest; for this reason, the gray point above with unknown class would be assigned to the green class.However, with the same data and k equal to 4, the figure below shows that a majority of the k nearest neighbors are blue, and so we classify the unknown point as blue. This example reinforces a theme of this course, that the results of an algorithm can be sensitive to our choice of parameters.When using k-NN with k equal to 4, k-NN classifies the unknown point as blue, since three of its four closest neighbors are blue.STOP: When k is equal to 2 or 6 for the ongoing example, we obtain a tie in the number of points from each known class belonging to the k nearest neighbors of a point with unknown class. How could we break ties in k-NN?In the more general case in which feature vectors have n coordinates, we can determine which points are nearest to a given point by using the Euclidean distance, which generalizes the distance formula between vectors in three-dimensional space to the case of n-dimensional vectors. The Euclidean distance between vectors x = (x1, x2, …, xn) and y = (y1, y2, …, yn) is given by the sum of squares of differences between corresponding vector elements:\[d(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n-y_n)^2}\,.\]We now have learned how to use k-NN to classify feature vectors with unknown classes given vectors with known classes. There is just one problem: how can we convert an image of a WBC nucleus into a vector?Next lesson            Anderson E (1935) The irises of the Gaspe Peninsula. Bulletin of the American Iris Society 59: 2-5. &#8617;              Anderson E (1936) The species problem in Iris. Annals of the Missouri Botanical Garden 23(3):457-509. Available online &#8617;              Fisher RA (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(2):179-188. Available online &#8617;              Fix E. and Hodges J.L. (1951) Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: The Robustness of Biological Oscillators",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/conclusion",
        "date"     : "",
        "content"  : "Biological oscillators must be robustIf your heart skips a beat when you are watching a horror movie, then it should return quickly to its natural rhythm. When you hold your breath to dive underwater, your normal breathing resumes at the surface. And regardless of what functions your cells perform, they should be able to maintain a normal cell cycle despite disturbance.An excellent illustration of oscillator robustness is the body’s ability to handle jet lag. There is no apparent reason why we would have evolved to fly halfway around the world in a few hours. And yet our circadian clock is so resilient that after a few days of fatigue and crankiness, it returns us to a normal daily cycle.In the previous lesson, we saw that the repressilator, a three-element motif, can exhibit oscillations even in a noisy environment of randomly moving particles. The repressilator’s resilience makes us wonder how well it can respond to a disturbance in the concentrations of its particles.A coarse-grained repressilator modelWith our work on Turing patterns in the prologue, tracking the movements of many individual particles led to a slow simulation that did not scale well given more particles or reactions. This observation led us to devise a coarse-grained reaction-diffusion model that was still able to produce Turing patterns. We used a cellular automaton because the concentrations of particles varied in different locations and were diffusing at different rates.We would like to devise a coarse-grained model of the repressilator. However, the particles diffuse at the same rate and are uniform across the simulation, and so there is no need to track concentrations in individual locations. As a result, we will use a simulation that assumes that particles are well-mixed.For example, say that we are modeling a degradation reaction. If we start with a concentration of 10,000 X particles, then after a single time step, we will simply multiply the number of X particles by (1-k), where k is a parameter related to the rate of the degradation reaction.As for a repression reaction like X + Y  X, we can update the concentration of Y particles by subtracting some factor r times the current concentrations of X and Y particles.We will further discuss the technical details required to implement a well-mixed reaction-diffusion model in the next module. In the meantime, we would like to see what happens when we make a major disturbance to the concentration of one of the particles in the well-mixed model. Do particle concentrations resume their oscillations? To build this model of the repressilator, check out the following tutorial.Visit tutorialThe repressilator is robust to disturbanceThe figure below shows plots over time of particle concentrations in our well-mixed simulation of the repressilator. (Note that these plots are less noisy than the ones that we produced previously because we are assuming a well-mixed environment.)  Midway through this simulation, we greatly increase the concentration of Y particles.A plot of particle concentrations in the well-mixed repressilator model over time. Adding a significant number of Y particles to our simulation (the second blue peak) produces little ultimate disturbance to the concentrations of the three particles, which return to normal oscillations within a single cycle.Because of the spike in the concentration of Y, the reaction Y + Z  Y suppresses the concentration of Z for longer than usual, and so the concentration of X is free to increase for longer than normal. As a result, the next peak of X particles is higher than normal.We might hypothesize that this process would continue, with a tall peak in the concentration of Z. However, the peak in the concentration of Z is no taller than normal, and the following peak shows a normal concentration of Y. The system has very quickly absorbed the blow of an increase in concentration of Y and returned to normal within one cycle.Even with a much larger jolt to the concentration of Y, the concentrations of the three particles return to normal oscillations very quickly, as shown below.A larger increase in the concentration of Y particles than in the previous figure does not produce a substantive change in the system.The repressilator is not the only network motif that leads to oscillations of particle concentrations, but robustness to disturbance is a shared feature of all these motifs. Furthermore, the repressilator is not the most robust oscillator that we can build. Researchers have shown that at least five components are typically needed to build a very robust oscillator,1 which may help explain why real oscillators tend to have more than three components.In the next module, we will encounter a much more involved biochemical process, with far more molecules and reactions, that is used by bacteria to cleverly (and robustly) explore their environment. In fact, we will have so many particles and so many reactions that we will need to completely rethink how we formulate our model.In the meantime, check out the exercises below to continue building your understanding of transcription factor networks and network motifs.Visit exercisesNext module            Castillo-Hair, S. M., Villota, E. R., &amp; Coronado, A. M. (2015). Design principles for robust oscillatory behavior. Systems and Synthetic Biology, 9(3), 125–133. https://doi.org/10.1007/s11693-015-9178-6 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: Toward Deep Learning",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/conclusion",
        "date"     : "",
        "content"  : "A brief introduction to artificial neuronsThe best known classification algorithms for WBC image analysis1 use a technique called deep learning. You have probably seen this term wielded with reverence, and so in this chapter’s conclusion, we will briefly explain what it means and how it can be applied to classification.Neurons are cells in the nervous system that are electrically charged and that use this charge as a method of communication to other cells. As you are reading this text, huge numbers of neurons are firing in your brain as it processes the visual information that it receives. The basic structure of a neuron is shown in the figure below.		The components of a neuron. Electrical signals are passed down axons and exchanged at terminal boundaries between neurons. Image courtesy: Jennifer Walinga.In 1943, Warren McCulloch (a neuroscientist) and Walter Pitts (a logician) devised an artificial model of a neuron that is now called a McCulloch-Pitts neuron.2 A McCulloch-Pitts neuron has a fixed threshold b and takes as input n binary variables x1, …, xn, where each of these variables is equal to either 0 or 1. The neuron outputs 1 if x1 +  + xn  b; otherwise, it returns 0. If a McCulloch-Pitts neuron outputs 1, then we say that it fires.The McCulloch-Pitts neuron with n equal to 2 and b equal to 2 is shown in the figure below. The only way that this neuron will fire is if both inputs x1 and x2 are equal to 1.		A McCullough-Pitts neuron with n equal to 2 and *b* equal to 2. The neuron fires when x1 + x2 is at least equal to *b*, which occurs precisely when both input variables are equal to 1; if either input variable is equal to 0, then x1 + x2 will be less than *b* and the neuron will not fire (i.e., it will output 0).Note: The mathematically astute reader may have noticed that the output of the McCulloch-Pitts neuron in the figure above is identical to the logical proposition x1 AND x2, which explains why these neurons started as a collaboration between a neuroscientist and a logician.In 1958, Frank Rosenblatt generalized the McCulloch-Pitts neuron into a perceptron.3 This artificial neuron also has a threshold constant b and n binary input variables xi, but it also includes a collection of real-valued constant weights wi that are applied to each input variable. That is, the neuron will output 1 (fire) when the weighted sum w1 · x1 + w2 · x2 +  + wn · xn is greater than or equal to b.Note: A McCulloch-Pitts neuron is a perceptron for which all of the wi are equal to 1.For example, consider the perceptron shown in the figure below; we assign the weight wi to the edge connecting input variable xi to the neuron.		A perceptron with two input variables. The perceptron includes a constant threshold and constant weights w1 and w2. The perceptron outputs 1 when the weighted sum w1 · x1 + w2 · x2 is greater than or equal to *b*, and it outputs 0 otherwise.The modern concept of an artificial neuron, as shown in the figure below, generalizes the perceptron further in two ways. First, the input variables xi can have arbitrary decimal values (often, these inputs are constrained to be between 0 and 1). Second, rather than the neuron rigidly outputting 1 when w1 · x1 + w2 · x2 +  + wn · wn is greater than or equal to b, we subtract b from the weighted sum and pass the resulting value into a function f called an activation function; that is, the neuron outputs f(w1 · x1 + w2 · x2 +  + wn · wn). In this form of the neuron, b is called the bias of the neuron.		A general form of an artificial neuron for two input variables x1 and x2, two constant weights w1 and w2, a constant bias *b*, and an activation function f. The output of the neuron, rather than being strictly 0 or 1, is f(w1 · x1 + w2 · x2 - *b*).A commonly used activation function is the logistic function, f(x) = 1/(1 + e-x), shown in the figure below. Note that the output of this function ranges between 0 (when its input is very negative) and 1 (when its input is very positive).		A plot of the logistic function f(x) = 1/(1 + e-x), an increasing function whose values range between 0 and 1.STOP: Because of its simplicity, researchers now often use a “rectifier” activation function: f(x) = max(0, x). What does the graph of this function look like? What is the activation function used by a perceptron, and how does it differ from the rectifier function?Framing a classification problem using neural networksThe outputs of mathematical functions can be used as inputs to other functions via function composition. For example, if f(x) = 2x-1, g(x) = ex, and h(x) = cos(x), then h(g(f(x))) = cos(e2x-1). Similarly, we can use artificial neurons as building blocks by linking them together, with the outputs of some neurons serving as inputs to other neurons. Linking neurons in this way produces a neural network such as the one shown in the figure below, which we will take time to explain.		An illustration of a potential neural network used for WBC image classification. This network assumes that each WBC is represented by *n* features, which serve as the input variables for the network. A number of hidden layers of additional neurons may be used, with connections between some of the neurons in adjacent layers. A final output layer of three neurons corresponds to each of the three WBC classes; our hope is that the weights of the neurons in the network are chosen so that the appropriate neuron outputs a value close to 1 corresponding to an image’s class, and that the other two neurons output values close to 0.We have discussed converting each object x in a dataset (such as our WBC image dataset) into a feature vector (x1, x2, …, xn) representing each of the n feature values of x. In the figure above, these n variables of the data’s feature vectors become the n input variables of the neural network.We typically then connect all the input variables to most or all of a collection of (possibly many) artificial neurons, called a hidden layer, which is shown for simplicity as a gray box in the figure above. If we have m artificial neurons in the hidden layer and n input variables, then we will have m bias constants as well as m · n weights, each one assigned to an edge connecting an input variable to a hidden layer neuron (all these edges are indicated by dashed edges in the figure above). Our model has quickly accumulated an enormous number of parameters!The first hidden layer of neurons may then be connected as inputs to neurons in another hidden layer, which are connected to neurons in another layer, and so on. As a result, practical neural networks may have several hidden layers with thousands of neurons, each with their own biases, input weights, and even different activation functions. The most common usage of the term deep learning refers to solving problems using neural networks having several hidden layers; the discussion of the many challenges in designing neural networks for deep learning would fill an entire course.The remaining question is what the output of our neural network should be. If we would like to apply the network to classify our data into k classes, then we typically will connect the final hidden layer of neurons to k output neurons. Ideally, if we know that a data point x belongs to the i-th class, then when we use the values of its feature vector as input to the network, we would like for the i-th output neuron to output a value close to 1, and for all other output neurons to output a value close to 0. For a neural network to correctly classify objects in our dataset, we must find such an ideal choice for the biases of each neuron and the weights assigned to input variables at each neuron  assuming that we have decided on which activation function(s) to use for the network’s neurons. We will now define quantitatively what makes a given choice of neural network parameters suitable for classification.STOP: Say that a neural network has 100 input variables, three output neurons, and four hidden layers of 1000 neurons each. Say also that every neuron in one layer is connected as an input to every neuron in the next layer. How many bias parameters will this network have? How many weight parameters will this network have?Defining the best choice of parameters for a neural networkIn a previous lesson, we discussed how to assess the results of a classification algorithm like k-NN on a collection of data with known classes. To generalize this idea to our neural network classifier, we divide our data that have known classes into a training set and a test set, where the former is typically much larger than the latter. We then seek the choice of parameters for the neural network that “performs the best” on the training set, which we will now explain.Each data point x in the training set has a ground truth classification vector c(x) = (c1, c2, …, ck),, where if x belongs to class j, then cj is equal to 1, and the other elements of c(x) are equal to 0. The point x also has an output vector o(x) = (o1, o2, …, ok), where for each i, oi is the output of the i-th output neuron in the network when x is used as the network’s input. The neural network is doing well at identifying the class of x when the classification vector c(x) is similar to the output vector o(x).Fortunately, we have been using a method of comparing two vectors throughout this book. The RMSD between c(x) and o(x) measures how well the network classified data point x, with a value close to 0 representing a good fit. We can obtain a good measure of how well a neural network with given weight and bias parameters performs on a training set on the whole by taking the average RMSD between classification and output vectors over every element in the training set. We therefore would like to choose the biases and input weights for the neural network that minimize this average RMSD for all objects in the training set.Once we have chosen a collection of bias and weight parameters for our network that perform well on the training set, we then assess how well these parameters performs on the test set. To this end, we can insert the feature vector of each test set object x as input into the network and consult the output vector o(x). Whichever i maximizes oi for this output vector becomes the assigned class of x. We can then use the metrics introduced previously in this module for quantifying the quality of a classifier to determine how well the neural network performs at classifying objects from the test set.This discussion has assumed that we can easily determine the best choice of network parameters to produce a low mean RMSD between output and classification vectors for the training set. But how can we find this set of parameters in the first place?Exploring a neural network’s parameter spaceThe typical neural network contains anywhere from thousands to billions of biases and input weights. We can think of these parameters as forming the coordinates of a vector in a high-dimensional space. From the perspective of producing low mean RMSD between output and classification vectors over a collection of training data, the vast majority of the points in this space (i.e., choices of network parameters) are worthless. In this vast landscape, a tiny number of these parameter choices will provide good results on our training set; even with substantial computational resources, finding one of these points is daunting.The situation in which we find ourselves is remarkably similar to one we have encountered throughout this course, in which we need to explore a search space for some object that optimizes a function. We would therefore like to design a local search algorithm to explore a neural network’s parameter space.As with ab initio structure prediction, we could start with a random choice of parameters, make a variety of small changes to the parameter values to obtain a set of “nearby” parameter vectors, and update our current parameter vector to the parameter vector from this set that produces the greatest decrease in mean RMSD between output and classification vectors. We then continue this process of making small changes to the current parameter vector until this mean RMSD stops decreasing. This local search algorithm is similar to the most popular approach for determining parameters for a neural network, called gradient descent.STOP: What does a local minimum mean in the context of neural network parameter search?Just as we run ab initio structure prediction algorithms using many different initial protein conformations, we should run gradient descent for many different sets of randomly chosen initial parameters. In the end, we take the choice of parameters minimizing mean RMSD over all these trials.Note: If you find yourself interested in deep learning and would like to learn more, check out the excellent Neural Networks and Deep Learning online book by Michael Nielsen.Neural network pitfalls, Alphafold, and final reflectionsNeural networks are wildly popular, but they have their own issues. Because we have so much freedom for how the neural network is formed, it is challenging to know a priori how to design an “architecture” for how the neurons should be connected to each other for a given problem.Once we have decided on an architecture, the neural network has so many bias and weight parameters that even with access to a supercomputer, it may be difficult to find values for these parameters that perform even reasonably well on the training set; the situation of having parameters with high RMSD for the training set is called “underfitting”. Even if we build a neural network having low mean RMSD for the training set, the neural network may perform horribly on the test set, which is called “overfitting” and offers yet another instance of the curse of dimensionality.Despite these potential concerns with neural networks, they are starting to show promise of making significant progress in solving biological problems. AlphaFold, which we introduced when discussing protein folding, is powered by neural networks that contain 21 million parameters. Yet although AlphaFold has revolutionized the study of protein folding, just as many problems exist for which neural networks are struggling to make progress over existing methods. Biology largely remains, like the environment of a lonely bacterium, an untouched universe waiting to be explored.Thank you!If you are reading this, and you’ve made it through our entire course, thank you for joining us on this journey! We are grateful that you gave your time to us, and we wish you the best on your educational journey. Please don’t hesitate to contact us if you have any questions, feedback, or would like to leave us a testimonial; we would love to hear from you.Visit exercises            Habibzadeh M, Jannesari M, Rezaei Z, Baharvand H, Totonchi M. Automatic white blood cell classification using pre-trained deep learning models: ResNet and Inception. Proc. SPIE 10696, Tenth International Conference on Machine Vision (ICMV 2017), 1069612 (13 April 2018). Available online &#8617;              McCulloch WS, Pitts WS 1943. A Logical calculus of the ideas Immanent in nervous activity. The bulletin of mathematical biophysics (5): 115–133. Available online &#8617;              Rosenblatt M 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review 65 (6): 386. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: Turing Patterns are Fine-Tuned",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/conclusion",
        "date"     : "",
        "content"  : "In both the particle-based and automaton model for Turing patterns, we observed that the model is fine-tuned, meaning that very slight changes in parameter values can lead to significant changes in the system. These changes could convert spots to stripes, or they could influence how clearly defined the boundaries of the Turing patterns are.The figure below shows how the Turing patterns produced by the Gray-Scott model change as the kill and feed rates vary. The square at position (x, y) shows the pattern obtained as the result of a Gray-Scott simulation with kill rate x and feed rate y. Notice how much the patterns change! You may like to tweak the parameters of the Gray-Scott simulation from the previous lesson to see if you can reproduce these differing patterns.Changing kill (x-axis) and feed (y-axis) parameters greatly affects the Turing patterns obtained in the Gray-Scott model. Each small square shows the patterns obtained from a given choice of feed and kill rate.  Note that many choices of parameters do not produce Turing patterns, which only result from a narrow “sweet spot” band of parameter choices. Image courtesy: Robert Munafo.1Later in this course, we will see an example of a biological system that is the opposite of fine-tuned. In a robust system, perturbations such as parameter variations do not lead to substantive changes in the ultimate behavior of the system.  Robustness is vital for processes, like your heartbeat, that must be resilient to small environmental changes.It turns out that although Turing’s work offers a compelling argument for how zebras might have gotten their stripes, the cellular mechanism causing these stripes to form remains unidentified. However, researchers have shown that the skin of zebrafish does exhibit Turing patterns because two types of pigment cells serve as “particles” following a reaction-diffusion model much like the one we presented in this prologue.2Finally, take a look at the following two photos of giant pufferfish.34 Genetically, these fish are practically identical, but their skin patterns are very different. What may seem like a drastic change from spots to stripes is likely attributable to a small change of parameters in a fine-tuned biological system that, like much of life, is powered by randomness.                                                                        Two similar pufferfish with very different skin Turing patterns. (Left) A juvenile Mbu pufferfish with spotted skin. (Right) An adult Mbu pufferfish with striped skin.  A final noteThank you for making it this far! We hope that you are enjoying the course. You can visit the exercises for this module or skip ahead to the next module by clicking on the appropriate button below. We also ask that you complete the course survey if you have not done so already.Visit exercisesNext module            “Reaction-Diffusion by the Gray-Scott Model: Pearson’s Parametrization” © 1996-2020 Robert P. Munafo https://mrob.com/pub/comp/xmorphia/index.html &#8617;              Nakamasu, A., Takahashi, G., Kanbe, A., &amp; Kondo, S. (2009). Interactions between zebrafish pigment cells responsible for the generation of Turing patterns. Proceedings of the National Academy of Sciences of the United States of America, 106(21), 8429–8434. https://doi.org/10.1073/pnas.0808622106 &#8617;              NSG Coghlan, 2006 Creative Commons Attribution-Share Alike 3.0 Unported &#8617;              Chiswick Chap, 20 February 2012, Creative Commons Attribution-Share Alike 3.0 Unported &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Conclusion: The Beauty of *E. coli*&#39;s Robust Randomized Exploration Algorithm",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/conclusion",
        "date"     : "",
        "content"  : "Two randomized exploration strategiesIn the prologue, we saw that a particle taking a collection of n unit steps in random directions will wind up on average a distance proportional to \(\sqrt{n}\) units away from its starting position. We now will compare such a random walk against a modified algorithm that emulates the  behavior of E. coli by changing the length of a step (i.e., how long the bacterium tumbles) based on the relative change in background attractant concentration.We will represent a bacterium as a particle traveling in two-dimensional space. Units of distance will be measured in µm; recall from the introduction that a bacterium can cover 20 µm in a second during an uninterrupted run. The bacterium will start at the origin (0, 0).We will use L(x, y) to denote the ligand concentration at (x, y) and establish a point (called the goal) at which L(x, y) is maximized. We will place the goal at (1500, 1500), so that the bacterium must travel a significant distance from the origin to reach the goal.We would like the ligand concentration L(x, y) to decrease exponentially the farther we travel from the goal. We therefore set L(x, y) = 100 · 106 · (1-d/D), where d is the distance from (x, y) to the goal, and D is the distance from the origin to the goal, which in this case is 1500√2  2121 µm. At the origin, the attractant concentration is equal to 100, and at the goal, the attractant concentration is equal to 100,000,000.STOP: How can we quantify how well a bacterium has done at finding the attractant?We are comparing two different cellular behaviors, and so in the spirit of Module 1, we will simulate many random walks of a particle following each of the two strategies, described in what follows. (The total time needed by our simulation should be large enough to allow the bacterium to have enough time to reach the goal.) For each strategy, we will then measure how far on average a bacterium with each strategy is from the goal at the end of the simulation.Strategy 1: Standard random walkTo model a particle following an “unintelligent” random walk strategy, we first select a random direction of movement along with a duration of tumble. The angle of reorientation is a random number selected uniformly between  and 360°. The duration of each tumble is a “wait time” of sorts and follows an exponential distribution with experimentally verified mean equal to 0.1 seconds1. As the result of a tumble, the cell only changes its orientation, not its position.We then select a random duration to run and let the bacterium run in that direction for the specified amount of time. The duration of each run follows an exponential distribution with mean equal to the experimentally verified value of 1 second.We then iterate the two steps of tumbling and running until the total time allocated for the simulation has elapsed.In the following tutorial, we simulate this naive strategy using a Jupyter notebook that will also help us visualize the results of the simulation.Visit tutorialStrategy 2: Chemotactic random walkIn our second strategy, which we call the “chemotactic strategy”, we mimic the real response of E. coli to its environment based on what we have learned about chemotaxis throughout this module. The simulated bacterium will still follow a run and tumble model, but the duration of each run, which is a function of the bacterium’s tumbling frequency, will depend on the relative change in attractant concentration that it detects.To ensure a mathematically controlled comparison, we will use the same approach for sampling the duration of a tumble and the direction of a run as in the first strategy.We have seen in this module that it takes E. coli about half a second to respond to a change in attractant concentration. We use tresponse to denote this “response time”; to produce a reasonable model of chemotaxis, we will check the attractant concentration of a running particle at the particle’s current location every tresponse seconds.We will then measure the percentage difference between the attractant concentration L(x, y) at the cell’s current point and the attractant concentration at the cell’s previous point, tresponse in the past; we denote this difference as Δ[L]. If Δ[L] is equal to zero, then the probability of a tumble in the next tresponse seconds should be the same as the likelihood of a tumble in the first strategy over the same time period. If Δ[L] is positive, then the probability of a tumble should be greater than it was in strategy 1; if Δ[L] is negative, then the probability of a tumble should be less than it was in strategy 1.To model the relationship between the likelihood of a tumble and the value of Δ[L], we will let t0 denote the mean background run duration, which in the first strategy was equal to one second. We would like to use a simple formula for the expected run duration, such as t0 * (1 + 10 · Δ[L]).Unfortunately, there are two issues with this formula. First, if Δ[L] is less than -0.1, then the run duration could be negative. Second, if Δ[L] is large, then the bacterium will run for so long that it could reach the goal and run past it.To fix the first issue, we will first take the maximum of t0 * (1 + 10 · Δ[L]) and some small positive number c (we will use c equal to 0.000001). As for the second issue, we will then take the minimum of the resulting expression and 4 · t0. The final value,μ = min(max(t0 * (1 + 10 · Δ[L]), c), 4 · t0),becomes the mean run duration of a bacterium based on the recent relative change in concentration.STOP: What is the mean run duration μ when Δ[L] is equal to zero? Is this what we would hope?As with the first strategy, our simulated cell will alternate between tumbling and running in a random direction until the total time devoted to the simulation has elapsed. The only difference in the second strategy is that we will measure the percentage change in concentration Δ[L] between a cell’s current point and its previous point every tresponse seconds. After determining a mean run time μ according to the expression above, we will sample a random number p from an exponential distribution with mean μ, and the cell will tumble after p seconds if p is smaller than tresponse.In the following tutorial, we will adapt the Jupyter notebook that we built in the previous tutorial to simulate this second strategy and run it many times, taking the average of the simulated bacteria’s distance to the goal.Visit tutorialComparing the effectiveness of our two random walk strategiesThe following figure visualizes the trajectories of three cells over 500 seconds using strategy 1 (left) and strategy 2 (right) with a default tumbling frequency t0 of one second. Unlike the cells following strategy 1, the cells following strategy 2 quickly hone in on the goal and remain near it.Three sample trajectories for the standard random walk strategy (left) and chemotactic random walk strategy (right). The standard random walk strategy is shown on the left, and the chemotactic random walk is shown on the right. Redder regions correspond to higher concentrations of ligand, with a goal having maximum concentration at the point (1500, 1500), which is indicated with a blue square. Each particle’s walk is colored from darker to lighter colors across the time frame of its trajectory.We should be wary of such a small sample size. To confirm that what we observed in these trajectories is true in general, we will compare the two strategies over many simulations. The following figure visualizes the particle’s average distance to the goal over 500 simulations for both strategies and confirms our previous observation that strategy 2 is effective at guiding the simulated particle to the goal. And yet the direction of travel in this strategy is random, so why would this strategy be so successful?Distance to the goal plotted over time for 500 simulated particles following the standard random walk strategy (pink) and the chemotactic random walk strategy (green). The dark lines indicate the average distance over all simulations, and the shaded area around each line represents one standard deviation from the average.The chemotactic strategy works because of a  “rubber band” effect. If the bacterium is traveling down an attractant gradient (i.e., away from an attractant), then it is not allowed to travel very far in a single step before it is forced to tumble. If an increase of attractant is detected, however, then the cell can travel farther in a single direction before tumbling. On average, this effect serves to pull the bacterium in the direction of increasing attractant, even though the directions in which it travels are random.A tiny change to a simple, unsuccessful randomized algorithm can therefore produce an elegant approach for exploring an unknown environment. But we left one more question unanswered: why is the default frequency of one tumble per second stable across a wide range of bacteria? To address this question, we will see how changing t0, the default time for a run step in the absence of change in attractant concentration, affects the ability of a simulated bacterium following strategy 2 to reach the goal. You may like to adjust the value of t0 in the previous tutorial yourself, or follow the tutorial below.Visit tutorialWhy is background tumbling frequency constant across bacterial species?The following figures show three trajectories for a few different values of t0 and a simulation that lasts for 800 seconds. First, we set t0 equal to 0.2 seconds and see that the simulated bacteria are not able to walk far enough in a single step to head toward the goal.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 0.2 seconds.If we increase t0 to 5.0 seconds, then cells can run for so long that they may run past the goal without being able to brake by tumbling.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 5.0 seconds.When we set t0 equal to 1.0, then the figure below shows a “Goldilocks” effect in which the simulated bacterium can run for long enough at a time to head quickly toward the goal, and it tumbles frequently enough to keep it near the goal.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 1.0 seconds.The figure below visualizes average particle distance to the goal over time for 500 particles using a variety of choices of t0. It confirms that tumbling every second by default is “just right” for finding an attractant.Average distance to the goal over time for 500 cells. Each colored line indicates the average distance to the goal over time for a different value of t0; the shaded area represents one standard deviation.Below, we reproduce the video from earlier in this module showing E. coli moving towards a sugar crystal. This video shows that the behavior of real E. coli is similar to that of our simulated bacteria. Bacteria generally move towards the crystal and then remain close to it; some bacteria run by the crystal, but they turn around to move toward the crystal again.      Bacteria are even smarter than we thoughtIf you closely examine the video above, then you may be curious about the way that bacteria turn around and head back toward the attractant. When they reorient, their behavior appears more intelligent than simply walking in a random direction. As is often true in biology, the reality of the system that we are studying turns out to be more complex than we might at first imagine.The direction of bacterial reorientation is not completely random, but rather follows a normal distribution with mean of 68° and standard deviation of 36°2. That is, the bacterium typically does not tend to make as drastic of a change to its orientation as it would in a pure random walk, which would have an average change in orientation of 90°.Furthermore, the direction of the bacterium’s reorientation also depends on whether the cell is traveling in the correct direction.3 If the bacterium is moving up an attractant gradient, then it makes smaller changes in its reorientation angle, a feature that helps the cell continue moving straight if it is traveling in the direction of an attractant.For that matter, the run-and-tumble model of E. coli only represents one way for bacteria to explore their surroundings. Microorganisms have many different sizes, shapes, metabolisms, and they live in a very diverse range of environments, which has produced several other exploration strategies.4.We are fortunate to have this wealth of research on chemotaxis in E. coli, which may be the single most studied biological system from the perspective of understanding how chemical reactions produce emergent behavior. However, for the study of most biological systems, a clear thread connecting a reductionist view of the system to that system’s holistic behavior remains a dream. (For example: how can your thoughts while reading this parenthetical aside be distilled into the firings of individual neurons?)  Fortunately, we can be confident that uncovering the underlying mechanisms of biological systems will continue to inspire the work of biological modelers for many years.Visit exercisesSkip to next module            Saragosti J., Siberzan P., Buguin A. 2012. Modeling E. coli tumbles by rotational diffusion. Implications for chemotaxis. PLoS One 7(4):e35412. available online. &#8617;              Berg HC, Brown DA. 1972. Chemotaxis in Escherichia coli analysed by three-dimensional tracking. Nature. Available online &#8617;              Saragosti J, Calvez V, Bournaveas, N, Perthame B, Buguin A, Silberzan P. 2011. Directional persistence of chemotactic bacteria in a traveling concentration wave. PNAS. Available online &#8617;              Michell J.G., Kogure K. 2005. Bacterial motility: links to the environment and a driving force for microbial physics. FEMS Microbial Ecol 55:3-16. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Part 1 Conclusion: Protein Structure Prediction is Solved?",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/conclusion_part_1",
        "date"     : "",
        "content"  : "Protein structure prediction is an old problem. In 1967, the Soviets founded an entire research insitute dedicated to solving “the protein problem”; this institute still lives on today. Despite the difficulty of protein structure prediction, gradual algorithmic improvements and increasing computational resources have led biologists around the world to wish for the day when they could consider protein structure prediction to be solved.That day has come. Kind of.Every two years since 1994, a global contest called Critical Assessment of protein Structure Prediction (CASP) has allowed modelers to test their protein structure prediction algorithms against each other. The contest organizers compile a secret collection of experimentally verified protein structures and then run all submitted algorithms against these proteins.In 2020, the 14th iteration of this contest (CASP14) was won in a landslide. The second version of AlphaFold,1 a DeepMind project, vastly outperformed the world’s foremost structure prediction approaches.The algorithm powering AlphaFold is an extremely involved method based on deep learning, a topic that we will discuss in this work’s final module. If you’re interested in learning more about this algorithm, consult the AlphaFold website or this excellent blog post by Mohammed al Quraishi: https://bit.ly/39Mnym3.Instead of using RMSD, CASP scores a predicted structure against a known structure using the global distance test (GDT). This test first asks, “How many corresponding alpha carbons are close to each other in the two structures?” To answer this question, we take the percentage of corresponding alpha carbon positions whose distance from each other is at most equal to some threshold t. The GDT score averages the percentages obtained when t is equal to each of 1, 2, 4, and 8 angstroms. A GDT score of 90% is considered good, and a GDT score of 95% is considered excellent (i.e., comparable to minor errors resulting from experimentation) 2.We will show a few plots to illustrate the decisiveness of AlphaFold’s CASP victory. The first graph, which is shown in the figure below, compares the GDT scores of AlphaFold against the second-place algorithm (a product of David Baker’s laboratory, which developed the Robetta and Rosetta@Home software).A plot of GDT scores for the AlphaFold2 (blue) and Baker lab (orange) submissions over all proteins in the CASP14 contest. AlphaFold2 finished first in CASP14, and Baker lab finished second. Image courtesy: Mohammed al Quraishi.We can appreciate the size of the margin of victory in the above figure if we compare it against the difference between the second and third place competitors, shown in the figure below.A plot of GDT scores for the Baker lab (blue) and Zhang lab (orange) submissions for all proteins in the CASP14 contest. Baker lab finished second in CASP14, and Zhang lab finished third. Image courtesy: Mohammed al Quraishi.For each protein in the CASP14 contest, we can also compute each algorithm’s z-score, defined as the number of standard deviations that the algorithm’s GDT score falls from the mean GDT score over all competitors. For example, a z-score of 1.4 would imply that the approach performed 1.4 standard deviations above the mean, and a z-score of -0.9 would imply that the approach performed 0.9 standard deviations below the mean.Summing all of an algorithm’s positive z-scores gives a reasonable metric for the relative quality of an algorithm compared to its competitors. If this sum of z-scores is large, then the algorithm performed significantly above average on the prediction of some proteins. The figure below shows the sum of z-scores for all CASP14 participants and reiterates the margin of AlphaFold’s victory, since its sum of z-scores is twice that of the second place algorithm.A bar chart plotting the sum of z-scores for every entrant in the CASP14 contest. AlphaFold2 is shown on the far left; its sum of z-scores is over double that of the second-place submission. Source: https://predictioncenter.org/casp14/zscores_final.cgi.AlphaFold’s CASP14 triumph led some scientists  and media outlets  to declare that protein structure prediction had finally been solved3. Yet some critics remained skeptical.Although AlphaFold obtained an impressive median RMSD of 1.6 angstroms for its predicted proteins, about a third of these predictions have an RMSD over 2.0 angstroms, which we mentioned earlier is often used as a threshold for whether a predicted structure is reliable. For a given protein, we will not know in advance whether AlphaFold’s predicted structure is outside of this range unless we experimentally validate the protein’s structure.Furthermore, some experts have claimed that to be completely trustworthy for a sensitive application like designing drugs to target proteins implicated in diseases, the RMSD of predicted protein structures would need to be nearly an order of magnitude lower, i.e., closer to 0.2 angstroms.Finally, the AlphaFold algorithm is “trained” using a database of known protein structures, which makes it more likely to succeed if a protein is similar to a known structure. But the proteins with structures that are dissimilar to any known structure are the ones possessing some of the greatest scientific interest.Pronouncing protein structure prediction to be solved may be hasty, but we will likely never again see such a clear improvement to the state of the art for structure prediction. AlphaFold represents, perhaps, the final great innovation for a research problem that has puzzled biologists for over half a century.Thus ends our discussion of protein structure prediction, but we still have much more to say. In particular, when comparing two protein structures, we have relied only upon the RMSD between the vectorizations of these two structures after applying the Kabsch algorithm. But using a single statistic to represent the differences between two protein structures masks what those differences might be. Furthermore, proteins are not static objects; they bend and shape in the cell as they perform their tasks. We will therefore now transition to a second part of our discussion of protein analysis, in which we show additional methods used to compare proteins and apply these techniques to the validated structures of the SARS-CoV and SARS-CoV-2 spike proteins.Visit part 1 exercisesContinue to part 2: spike protein comparison            Jumper, J et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–58. Available online &#8617;              AlQuraishi, M. 2020, December 8. AlphaFold2 @ CASP14: “It feels like one’s child has left.” Retrieved January 20, 2021, from https://bit.ly/39Mnym3 &#8617;              Service, R. F. (2020, November 30). ‘The game has changed.’ AI triumphs at solving protein structures. Science. doi:10.1126/science.abf9367 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Part 2 Conclusion: Bamboo Shoots After the Rain",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/conclusion_part_2",
        "date"     : "",
        "content"  : "This chapter has been quite a journey. We began with a discussion of the fundamental problem of determining a protein’s structure. Because experimental methods for identifying protein structure are costly and time consuming, we transitioned to cover algorithmic approaches that do a good job of predicting a protein’s structure from its sequence of amino acids.We then discussed how to compare protein structures, with a lengthy case study comparing the SARS-CoV and SARS-CoV-2 spike proteins. The problem of quantifying the difference between two structures is challenging, and we established both global and local structure comparison metrics. We applied these approaches to isolate three candidate regions of interest of the SARS-CoV-2 spike protein that differ from the SARS-CoV spike protein when complexed with the ACE2 enzyme, and we quantified the binding of this complex using a localized energy function.We concluded with a study of molecular dynamics. If we hope to fully understand a protein’s function, then we need to know how it flexes and bends within its environment, sometimes in order to interact with other molecules.Despite covering a great deal of ground, we have left just as much unstudied. For one example, the surface of viruses and host cells are “fuzzy” because they are covered by structures called glycans, or numerous monosaccharides linked together. SARS-CoV and SARS-CoV-2 have a “glycan shield”, in which glycosylation of surface antigens allows the virus to hide from antibody detection. Researchers have found that the SARS-CoV-2 spike protein is heavily glycosylated, shielding around 40% of the protein from antibody recognition1.Finally, although the study of proteins is in some sense the first major area of biological research to become influenced by computational approaches, we hope that this chapter has made it clear that the analysis of protein structure and dynamics is booming. With the COVID-19 pandemic reiterating the importance of biomedical research, the dawn of Alphafold showing the power of supercomputing to solve age-old problems, and the promise of computational modeling to discover the medications of the future, new companies and organizations are rising up to study proteins like bamboo shoots after the rain.Thus concludes our discussion of protein analysis. In the course’s final module, we will turn our attention to a very different type of problem. To fight a virus like SARS-CoV-2, your body employs a cavalry of white blood cells. Maintaining healthy levels of these cells is vital to a strong immune system, and blood reports run counts of these cells to ensure they are within normal ranges. Can we teach a computer to run this analysis automatically?Visit exercisesNext moduleNote: Although we have covered a great deal in this chapter, there is still much more to say about SARS-CoV-2. What happens after the spike protein binds to ACE2? How does the virus enter the cell and replicate? How does it fight our immune systems, and how should we design a vaccine to fight back? If you are interested in an online course covering some of these questions, then check out the free online course SARS Wars: A New Hope by Christopher Langmead.            Grant, O. C., Montgomery, D., Ito, K., &amp; Woods, R. J. Analysis of the SARS-CoV-2 spike protein glycan shield: implications for immune recognition. bioRxiv : the preprint server for biology, 2020.04.07.030445. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7217288/ &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Contact Us",
        "category" : "",
        "tags"     : "",
        "url"      : "/contact/",
        "date"     : "",
        "content"  : "Work on this project is ongoing. If you have any questions about the project, please use the form below. To report typos or bugs, please use the Disqus comments on the page corresponding to the issue.We would be happy to hear from you if you are a learner who is interested in providing a testimonial about how this course has been useful to you.If you are an instructor who is interested in adopting this course in your own teaching, whether in full or in individual pieces, please let us know as we are forming a network of instructors adopting this course.Please do not hesitate to contact us. We look forward to hearing from you!Contact Form      First Name (optional)        Last Name (optional)        Email Address        Your Message (optional)            "
     
   } ,
  
   {
     
        "title"    : "A Coarse-Grained Model of Particle Diffusion",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/diffusion_automaton",
        "date"     : "",
        "content"  : "A coarse-grained model of single particle diffusionPart of a modeler’s job is to find simple models that capture the essence of a system while running quickly and scaling well to larger inputs.Our model consumes a huge amount of computational resources because it must track the movements of hundreds of thousands of individual particles. Our goal is to build a “coarse-grained” model that will allow us to witness Turing patterns emerge without the computational overhead required to track individual particles.We will grid off a two-dimensional plane into blocks and store only the concentration of each type of particle found within each block.  To simplify things further, we will assume that the concentration of a particle in each block is represented by a decimal number (sometimes representing a percentage concentration) as opposed to counting individual particles.We will begin with an example of the diffusion of only A particles; we will later add B particles as well as reactions to our model. Say that the particles have concentration equal to 1 in the central cell of the grid and 0 everywhere else, as shown below.A 5 x 5 grid showing hypothetical initial concentrations of A particles. Cells are labeled by decimal numbers representing their concentration of A particles. The central cell has maximum concentration, and no particles are contained in any other cell.We will now update the grid of cells after one time step to mimic particle diffusion. To do so, we will spread out the concentration of particles in each square to its eight neighbors. For example, we could assume that 20% of the current cell’s concentration diffuses to each of its four adjacent neighbors, and that 5% of the cell’s concentration diffuses to each of its four diagonal neighbors. Because the central square in our ongoing example is the only cell with nonzero concentration, the updated concentrations after a single time step are shown in the following figure.A grid showing an update to the system in the previous figure after diffusion of particles after a single time step.Note: The sum of the values in both grids in the figure above is equal to 1, which ensures the conservation of total mass in the system.After an additional time step, the particles continue to diffuse outward. For example, each diagonal neighbor of the central cell in the above figure, which has a concentration of 0.05 after one time step, will lose all of its particles in the following step. Each of these cells will also gain 20% of the particles from two of its adjacent neighbors, along with 5% of the particles from the central square (whose concentration is zero). This makes the updated concentration of this cell after two time steps equal to 2(0.2)(0.2) + 0.05(0) = 0.08.The four cells directly adjacent to the central square, which have a concentration of 0.2 after one time step, will also gain particles from their neighbors. Each such cell will receive 20% of the particles from two of its adjacent neighbors and 5% of the particles from two of its diagonal neighbors, which have a concentration of 0.2. Therefore, the updated concentration of each of these cells after two time steps is 2(0.2)(0.05) + 2(0.05)(0.2) = 0.02 + 0.02 = 0.04.Finally, the central square has no particles after one step, but it will receive 20% of the particles from each of its four adjacent neighbors, as well as 5% of the particles from each of its four diagonal neighbors. As a result, the central square’s concentration after two time steps is 4(0.2)(0.2) + 4(0.05)(0.05) = 0.17.In summary, the central nine squares after two time steps are as shown in the following figure.A grid showing an update to the central nine squares of the diffusion system in the previous figure after an additional time step. The cells labeled “?” are left as an exercise for the reader.STOP: What should the values of the “?” cells be in the above figure?The coarse-grained model of particle diffusion that we have built is a variant of a cellular automaton, or a grid of cells in which we use fixed rules to update the status of a cell based on its current status and those of its neighbors. Cellular automata form a rich area of research applied to a wide variety of fields dating back to the middle of the 20th Century; if you are interested in learning more about them from the perspective of programming, then you might like to check out the Programming for Lovers project.Slowing down the diffusion rateThere is just one problem: our diffusion model is too volatile! The figure below shows the initial automaton as well as its updates after each of two time steps. In a true diffusion process, all of the particles would not rush out of the central square in a single time step, only for some of them to return in the next step.                                                                                                        A 5 x 5 cellular automaton model for diffusion of a single particle. (Left) The system contains a maximum concentration of particles in the central square. (Center) The system after one time step. (Right) The system after two time steps.  Our solution is to slow down the diffusion process by adding a parameter dA having values between 0 and 1 that represents the rate of diffusion of A particles. Instead of moving a cell’s entire concentration of particles to its neighbors in a single time step, we move only a fraction dA of them.Revisiting our original example, say that dA is equal to 0.2. After the first time step, only 20% of the central cell’s particles will be spread to its neighbors. Of these particles, 5% will be spread to each diagonal neighbor, and 20% will be spread to each adjacent neighbor. The figure below illustrates that after one time step, the central square has concentration equal to 0.8, its adjacent neighbors have concentration equal to 0.2dA = 0.04, and its diagonal neighbors have concentration equal to 0.05dA = 0.01.An updated grid of cells showing the concentration of A particles after one time step if dA = 0.2.Adding a second particle to our diffusion simulationWe now will add B particles to the simulation, which we assume also start with concentration equal to 1 in the central square and 0 elsewhere. Recall that B, our “predator” molecule, diffuses half as fast as A, the “prey” molecule. If we set the diffusion rate dB equal to 0.1, then our cells after a time step will be updated as shown in the figure below. This figure represents the concentration of the two particles in each cell as an ordered pair ([A], [B]).A figure showing cellular concentrations after one time step for two particles A and B that start at maximum concentration in the central square and diffuse at rates dA = 0.2 and dB = 0.1. Each cell is labeled by the ordered pair ([A], [B]).STOP: Update the cells in the above figure after another generation of diffusion. Use the diffusion rates dA = 0.2 and dB = 0.1.Visualizing particle concentrations in an automatonAs we move toward diffusing a large board that is hundreds of cells wide, listing the concentrations of our two particles in each cell will be difficult to analyze. Instead, we need some way to visualize the results of our diffusion simulation.First, we will consolidate the information stored in a cell about the concentrations of two particles into a single value. In particular, let a cell’s particle concentrations be denoted [A] and [B]. Then the single value [B]/([A] + [B]) is the ratio of the concentration of B particles to the total number of particles in the cell. This value ranges between 0 (no B particles present) and 1 (only B particles present).STOP: What should be the value of [B]/([A] + [B]) if both [A] and [B] are equal to zero?Next, we color each cell according to its value of [B]/([A] + [B]) using a color spectrum like those shown in the figure below. We will use the Spectral color map, meaning that if a cell has a value close to 0 (relatively few predators), then it will be colored red, while if it has a value close to the maximum value of [B]/([A] + [B]) (relatively many predators), then it will be colored dark blue.When we color each cell over many time steps, we can animate the automaton to see how it changes over time. We are now ready for the following tutorial, in which we will implement and visualize our diffusion automaton using a Jupyter notebook.Visit tutorialThe video below shows an animation of a 101 x 101 board with dA = 0.5 and dB = 0.25 that begins with [A] = 1 for all cells. All cells have [B] = 0 except for an 11 x 11 square in the middle of the grid, where [B] = 1. (There is nothing special about the dimensions of this central square.) Without looking at individual concentration values, this animation allows us to see immediately that the A particles are remaining in the corners, while a band of B particles expands outward from the center.Note: Particles technically “fall off” the sides of the board in the figure below, meaning that a given particle’s total concentration across all cells decreases over time.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Network Motifs Exercises",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/exercises",
        "date"     : "",
        "content"  : "A short introduction to statistical validationIn this chapter, we have largely appealed to intuition when claiming that the number of network structures that we observe is larger than what we would see due to random chance. For example, we pointed out that there are 130 loops in the E. coli transcription factor network, but we would only expect to see 2.42 such loops if the network were random.This argument hints at a foundational approach in statistics, which is to determine the likelihood that an observation would occur due to random chance; this likelihood is called a p-value. In this case, our observation is 130 loops, and the probability that this number of loops would have arisen in a random network of the same size is practically zero. The exact computation of this p-value is beyond the scope of our work here, but we will turn our attention to a different p-value that is easier to determine.We also mentioned that 95 of the 130 loops in the E. coli transcription factor network correspond to repression, which is another surprising fact, since in a random network, we would expect to see only about 65 loops that correspond to activation. We therefore hypothesize that negative autoregulation is on the whole more important to the cell than positive autoregulation.To compute a p-value associated with the frequency of negative autoregulation, we ask, “if the choice of a loop corresponding to activation or repession were random and equal, then what is the likelihood that out of 130 loops, 95 (or more) correspond to activation?” This question is biological but is equivalent to a question about flipping coins: “if we flip a coin 130 times, then what is the likelihood that the coin will come up heads 95 or more times?”Say that we observe four coin flips. Then there are 16 equally likely outcomes, shown in the table below.The 16 possible sequences resulting from flipping a coin four times.Each outcome is equally likely, and so we can compute the probability of observing k heads in the four flips by counting how many of the 16 cases have k heads:\[\begin{align*}\mathrm{Pr}(\text{0 heads}) &amp;= 1/16\,;\\\mathrm{Pr}(\text{1 heads}) &amp; = 4/16 = 1/4\,;\\\mathrm{Pr}(\text{2 heads}) &amp; = 6/16 = 3/8\,;\\\mathrm{Pr}(\text{3 heads}) &amp; = 4/16 = 1/4\,;\\\mathrm{Pr}(\text{4 heads}) &amp; = 1/16\,.\end{align*}\]To generalize these probabilities, if we flip a coin n times, then the probability of obtaining k heads is equal to\[\dbinom{n}{k} \cdot \left(\dfrac{1}{2}\right)^n\]where \(\binom{n}{k}\) is called the “combination statistic” and is equal to n!/(k! · (n  k)!).If we instead wish to compute the probability of obtaining at least k heads, then we simply need to sum the above expression for all values larger than k:\[\displaystyle\sum\limits_{j=k}^n{\dbinom{n}{j} \cdot \left(\dfrac{1}{2}\right)^n}\]This expression gives us our desired p-value.Exercise: When n = 130 and k = 95, compute the p-value given by the above expression. Is it likely that random chance could have produced 95 out of 130 loops in the transcription factor network that correspond to repression?Counting feedforward loopsExercise: Modify the Jupyter notebook provided in the tutorial on loops to count the number of feed-forward loops in the transcription factor network for E. coli.There are eight types of feed-forward loops based on the eight different ways in which we can label the edges in the network with a “+” or a “-“ based on activation or repression.The eight types of feed-forward loops.1Exercise: Modify the Jupyter notebook to count the number of loops of each type in the E. coli transcription factor network.Exercise: How many feed-forward loops would you expect to see in a random network having the same number of nodes as the E. coli transcription factor network? How does this compare to your answers to the previous two questions?Negative autoregulationExercise: One way for the cell to apply stronger “brakes” to the activation of a transcription factor would be to simply increase the degradation rate of that transcription factor, rather than implement negative autoregulation. From an evolutionary perspective, why do you think that the cell doesn’t do this?Recall that we used the reaction reaction X  X + Y to represent activation. We then built a model in a tutorial to run a mathematically controlled comparison between two simulated cells, one having only this reaction, and the other also having negative autoregulation, which we represented using the reaction Y + Y  Y.Exercise: Multiply the rate of the reaction X  X + Y by a factor of 100 in the cell having only simple regulation, and plot the concentration of Y in both cells (the updated table containing reactants, products, and reaction rate constants is found below). By approximately what factor do you need to increase the rate of this reaction in the cell that includes negative autoregulation so that the steady-state concentration of Y remains the same in both cells?            Reactants      Products      Forward Rate                  X1’      X1’ + Y1’      4e4              X2’      X2’ + Y2’      4e2              Y1’      NULL      4e2              Y2’      NULL      4e2              Y2’ + Y2’      Y2’      4e2      Positive autoregulationAlthough most of the autoregulating E. coli transcription factors exhibit negative autoregulation, 35 of these transcription factors autoregulate positively, meaning that the transcription factor activates its own regulation. This network motif exists in processes in which the cell needs a cell to be produced at a slower, more precise rate than it would under normal activation. This occurs in some genes related to development, when gene expression must be carefully controlled.Exercise: Design and implement a reaction-diffusion model to run a mathematically-controlled simulation comparing the positive autoregulation of a transcription factor Y against normal activation of Y by another transcription factor X. Plot the concentration of Y over time in the two simulations.Replicating the module’s conclusions with well-mixed simulationsRecall that a repressilator is a network motif in which X inhibits Y, which inhibits Z, which in turn inhibits Z. In a tutorial, we showed how to use a well-mixed simulation to build the repressilator, which we then perturbed.\tutorial[motifs/tutorial_perturb]\Exercise: Build well-mixed simulations to replicate the other network motif tutorials presented in this module.Next module            Image adapted from Mangan, S., &amp; Alon, U. (2003). Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences of the United States of America, 100(21), 11980–11985. https://doi.org/10.1073/pnas.2133841100 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Classification and Image Analysis Exercises",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/exercises",
        "date"     : "",
        "content"  : "Neural networks and logical connectivesOne of the strengths of neural networks is that their output can mimic, at least approximately, any function. In the case of classification, this means that if some function g(x1, …, xn) performs well at classifying data with n features, then some neural network exists that will replicate g: we just need to find the right set of network parameters.We say that a neural network with input variables x1, …, xn represents a function g(x1, …, xn) if for any choice of input variables x1, …, xn, the output of the neural network is equal to g(x1, …, xn). We provide two exercises on finding neural networks that represent relatively simple functions.Exercise: Consider the function g(x1, x2) for binary input variables x1 and x2 that outputs 0 when x1 and x2 are both equal to 1 and that outputs 1 for other choices of x1 and x2. (The function g is known as a “NAND gate”). Find a single perceptron that represents g.Exercise: Consider the function g(x1, x2) for binary input variables x1 and x2 that outputs 1 when x1  x2 and 0 when x1 = x2. (The function g is known as an “XOR gate”). It can be shown that no single perceptron represents g; find a neural network of perceptrons that represents g.A little fun with lost citiesExercise: Consider the three points x = (−8, 1), y = (7, 6), and z = (10, −2). Say that the distances from these points to some point w with unknown location are as follows: d(x,w) = 13; d(y,w) = 3; d(z,w) = 10. Where is w?More on the curse of dimensionalityIntuitively, we would like to have a large number of features in our data (i.e., a large dimension of n in each data point’s feature vector). Yet consider the figure below, which plots the petal width and length of only two iris flowers. It would be a horrible idea to extrapolate anything from the line connecting these two points, as it indicates that these two variables are inversely coordinated, which is the opposite of the true correlation that we found in the main text.Petal length (x-axis) plotted against petal width (y-axis) for two flowers in the iris flower dataset. Because of random chance and small sample size, these two flowers demonstrate an inverse correlation between petal length and width, the opposite of the true correlation found in the main text.This example provides another reason why we reduce the dimension of a dataset when the number of objects in our dataset is smaller than the number of features of each object. Furthermore, when fitting a d-dimensional hyperplane to a collection of data, we need to be careful with selecting too large of a value of d, especially if we do not have many data points.Exercise: What is the minimum number of points in three-dimensional space such that we cannot guarantee that there is some plane containing them all? Provide a conjecture as to the minimum number of points in n-dimensional space such that we cannot guarantee that there is some d-dimensional hyperplane containing them all.Irises, PCA, and feature selectionExercise: The iris flower dataset has four features. Apply PCA with d = 2 to reduce the dimension of this dataset. Then, apply k-NN with k equal to 1 and cross validation with f equal to 10 to the resulting vectors of reduced dimension to obtain a confusion matrix. What are the accuracy, recall, specificity, and precision? How do they compare against the results of using all four iris features that we found earlier?Exercise: Another way to reduce the dimension of a dataset is to eliminate features from the dataset. Apply k-NN with k equal to 1 and cross validation with f equal to 10 to the iris flower dataset using only the two features petal width and petal length. Then, run the same classifier on your own choice of two iris features to obtain a confusion matrix. How do the results compare against the result of the previous exercise (which used PCA instead of feature elimination) and those from using all four features?More classification of WBC imagesExercises coming soon!"
     
   } ,
  
   {
     
        "title"    : "Turing Pattern Exercises",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/exercises",
        "date"     : "",
        "content"  : "Solar photons and random walksPhotons are massless particles carrying electromagnetic radiation. In the sun’s core, photons are created as the result of nuclear fusion when two hydrogen atoms crash together and form a helium atom. The photons that are released have a great deal of kinetic energy, traveling at the speed of light (approximately 300,000,000 m/s). However, the atoms in the center of the sun are densely packed together, and so just like proteins in the cytoplasm, photons constantly bounce off atoms and follow random walks.The average distance that a photon travels between atom collisions is called the photon’s “mean free path”. The mean free path varies depending on the photon’s distance from the sun’s core, but conservative estimates of the average mean free path dictate that it is no more than 1 cm.Exercise: On average, what is the length of each step of a solar photon’s random walk? That is, based on a mean free path of 1 cm, what is the typical time between collisions for a given photon traveling at the speed of light?When you feel the warmth of sunshine, the photons colliding with your skin have traveled from the sun in just eight minutes. But how long did these photons take to reach the surface of the sun in the first place?Exercise: Use the Random Walk Theorem to approximate the number of steps that it will take for a photon to reach the sun’s corona from its center, given that the sun has a radius of 700,000 km. Then, use your solution to the previous exercise to estimate the expected time required for this journey.Practicing the cellular automaton model of diffusionEarlier in this module, we showed a grid of cells containing concentrations of two particles A and B that start with concentration equal to 1 in the central square and diffuse according to the rates dA = 0.2 and dB = 0.1. In a “STOP” question, we asked you to update this grid, reproduced below, after another time step of diffusion.A figure showing cellular concentrations after one time step for two particles A and B that start at maximum concentration in the central square and diffuse at rates dA = 0.2 and dB = 0.1. Each cell is labeled by the ordered pair ([A], [B]).Exercise: Instead of solely diffusing the particles, update the original grid (in which A and B have concentration equal to 1 in the central cell) for two time steps according to the Gray-Scott model. Use f = 0.03 and k = 0.1.Changing the predator-prey reactionBoth our particle simulator and the Gray-Scott model used a reaction A + 2B  3B to represent a predator-prey dynamics of sorts. But there is no reason a priori why we would have used this reaction; instead, we could have modeled the simpler reaction A + B  2B, in which a predator molecule collides with a prey molecule, and the prey molecule changes into a predator. Because this reaction only requires the collision of two particles, it would be more frequent than the reaction A + 2B  3B if all else is equal.Exercise: Adapt the particle-based simulation that we introduced earlier in a tutorial to replace the reaction A + B  3B with A + B  2B. Play around with the system’s reaction rate parameters; are you still able to generate Turing patterns?Exercise: How would the simulation from the Gray-Scott tutorial need to change to model the reaction A + B  2B instead of A + B  3B? Make the appropriate changes to the implementation of the Gray-Scott model; do you still observe Turing patterns?Changing Gray-Scott parametersRecall the figure below, which shows how changing feed and kill rates affect Turing patterns.Changing kill (x-axis) and feed (y-axis) parameters greatly affects the Turing patterns obtained in the Gray-Scott model. Each small square shows the patterns obtained from a given choice of feed and kill rate. Note that many choices of parameters do not produce Turing patterns, which only result from a narrow “sweet spot” band of parameter choices.Exercise: Try changing the diffusion rates in the Gray-Scott model to the values of dA = 0.1 and dB = 0.05. Do the same patterns result? What happens if we make the diffusion rates equal?When we diffuse particles in the automaton model (including the Gray-Scott model), we did not mention what happens at the boundary of the simulation because we were interested in the patterns that arise in the interior of the grid. The simulations shown in the figures in this chapter assume that the automaton is surrounded by a “buffer” of invisible cells that have zero concentration. In this way, the particles leaving the board simply flow off the edges.We could have used two additional methods to handle barriers. First, we could assume that particles flowing off one side of the board are added to the corresponding cells on the opposite side of the board. Second, we could assume that particles ricochet off the barriers,  so that any points that would flow off the side of the board instead remain in the current cell.In the Gray-Scott tutorial, this was handled by the setting the parameter boundary='fill' to signal.convolve2d(). The boundary parameter could also take the additional options boundary='wrap' or boundary='symm' to represent the above two cases, which can be combined with the \texttt{‘fillvalue’} option as documented here.Exercise: Adapt the simulation of the Gray-Scott model to handle each of these additional two cases. Do you see the same patterns arise?Next module"
     
   } ,
  
   {
     
        "title"    : "Exercises",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/exercises",
        "date"     : "",
        "content"  : "How does E. coli respond to repellents?Just as E. coli has receptors that bind to attractant ligands, it has other receptors that bind to repellent ligands. Attractant-ligand binding causes an increase in the autophosphorylation of CheA, but repellent-ligand binding causes a decrease in the autophosphorylation of CheA.Exercise: Based on what we have learned in this module about how E. coli and other bacteria act in the presence of an attractant, how do you think that bacteria respond in the presence of a repellent, and how do you think that the bacterium adjusts to relative changes of the repellent?We learned that E. coli is likely to run for longer when traveling up an attractant gradient, which in the long run means that it is able to find attractant sources despite running in random directions. For the same reason, E. coli is likely to run for longer when traveling down a repellent gradient.Exercise: Adapt the “chemotactic” random walk strategy that we implemented in a tutorial to handle the fact that bacteria sensing a relative decrease in repellent concentration will have longer runs before tumbling. Simulate this strategy for a collection of particles placed near a “goal” representing a repellent source. What is the average distance of the particles from the goal? How does it compare to the average distance to the goal for a collection of particles following a pure random walk?Traveling down an attractant gradient.A related question to how a bacterium responds to a repellent is how it behaves when traveling away from an attractant, i.e., down an attractant gradient. To model this situation, we will still use the function [L] = \(l_o \cdot e^{k \cdot t}\) from our tutorial modeling a bacterium traveling up an attractant gradient, but we will now assume that k is negative so that the concentration is decaying exponentially.Exercise: Adapt the gradient simulation to model the concentration of phosphorylated CheY over time for an exponentially decaying attractant concentration. How does the plot of phosphorylated CheY change as k gets more negative?What if E. coli has multiple attractant sources?Not only can E. coli sense both repellents and attractants, but it can detect more than one attractant gradient at the same time.  This function has a clear evolutionary purpose in a bacterial environment containing multiple dispersed food sources. We will now explore whether the chemotaxis mechanism allows cells to navigate through heterogeneous nutrient distributions.Exercise: Modify our model from the adaptation tutorial to reflect two types of receptor, each specific to its own ligand (call them A and B). Assume that we have 3500 receptor molecules of each type. (Hint: you will not need to have additional molecules in addition to L and T. Instead, specify additional states for the two molecules that we already have; for example L(t,Lig~A) should only bind with T(l,Lig~A). Don’t forget to update species as well!)In the previous exercise, the cell adapts to the presence of two different attractants at the same time. We now consider what will happen if we only add B molecules once the cell has already adapted to A molecules.Exercise: Change your model by assuming that after the cell adapts to 1,000,000 molecules of A, 1,000,000 molecules of B are added. Observe the concentration of phosphorylated CheY. Is the cell able to respond to the influx of B after adapting to the concentration of ligand A? Why do you think that the change in CheY phosphorylation is different from the scenario in which we release the two different ligands concurrently? (The hint for the previous exercise also applies to this exercise.)In the chemotactic walk tutorial, we used a concentration gradient that grew exponentially toward a single goal. Specifically, if L(x, y) was the concentration of ligand at (x, y), then we set L(x,y) = 100 · 106 · (1-d/D), where d is the distance from (x, y) to the goal, and D is the distance from the origin to the goal (we used a goal of (1500, 1500)).To generalize this simulation to an environment with more than one attractant source, we will include another goal at (-1500, 1500). The new ligand concentration formula will be L(x, y) = 100 · 106 · (1-d1/D1) + 100 · 106 · (1-d2/D2), where d1 is the distance from (x, y) to the goal at (1500, 1500), d2 is the distance from (x, y) to the goal at (-1500, 1500), and D1 and D2 are the distances from the origin to the two respective goals.Exercise: Change the chemotactic walk simulation so that it includes the two goals, and visualize the trajectories of several particles using a background time between tumbles (t0) equal to one second. Are the particles able to find one of the goals? How long does it take them, and how does the result compare against the case of a single goal?Exercise: Vary the tumbling frequency according to the parameters given in the chemotactic walk tutorial to see how tumbling frequency influences the average distance of a cell to the closer of the two goals. As in the tutorial, run your simulation for 500 particles with the default time between tumbles (t0) equal to each of 0.2, 0.5, 1.0, 2.0 and 5.0 seconds.Changing the reorientation angle of E. coliIn the conclusion, we mentioned that when E. coli tumbles, the degree of reorientation is not uniformly random from  to 360°. Rather, research has shown that it follows a normal distribution with mean of 68° (1.19 radians) and standard deviation of 36° (0.63 radians).Exercise: Modify your model from the chemotactic walk tutorial to change the random uniform sampling to this “smarter” sampling. Compare the chemotactic walk strategy and this smarter strategy by calculating the mean and standard deviation of each cell’s distance to the goal for 500 simulated cells with the default time between tumbles (t0) equal to each of 0.2, 0.5, 1.0, 2.0, and 5.0. Do these simulated cells do a better job of finding the goal compared to those of the original chemotactic strategy?More recent research suggests that when a bacterium is moving up an attractant gradient, the degree of reorientation may be even smaller1. Do you think that such a reorientation strategy would improve a cell’s chemotaxis response?Exercise: Modify your model from the previous exercise so that if the cell has just made a move of increasing ligand concentration, then its mean reorientation angle is 0.1 radians smaller. Calculate the mean and standard deviation of each cell’s distance to the goal for 500 cells with the default time between tumbles (t0) equal to each of 0.2, 0.5, 1.0, 2.0, and 5.0. Do the cells find the goal faster than they did in the preceding exercise?Can’t get enough BioNetGen?As we have seen in this module, rule-based modeling is successful at simulating systems that involve a large number of species and particles but can be summarized with a small set of rules.Polymerization reactions offer another example of such a system. Polymerization is the process by which solitary monomer molecules combine into chains called polymers. Biological polymers are everywhere, from DNA (formed of monomer nucleotides) to proteins (formed of monomer amino acids) to lipids (formed of monomer fatty acids). For a nonbiological example, polyvinyl chloride (which lends its name to “PVC pipe”) is a polymer comprising many vinyl molymers.We would like to simulate the polymerization of copies of a monomer A to form a polymer AAAAAA…, where the length of the polymer is allowed to vary. When we simulate this process, we are curious what the distribution of the polymer lengths will be.We will write our polymer reaction as Am + An -&gt; Am+n, where Am denotes a polymer consisting of m copies of A. Using classical reactions, this single rule would require an infinite number of reactions; will rule-based modeling and BioNetGen come to our rescue?There are two sites on the monomer A that are involved in a polymerization reaction: the “head” and the “tail”. For two monomers to bind, we need the head on one monomer and the tail on another to both be free. The following BioNetGen model is taken from the BioNetGen tutorials.Create a new BioNetGen file and save it as polymers.bngl. We will have only one molecule type: A(h,t); the h and t labels indicate the “head” and “tail” binding sites, respectively. To model polymerization, we will need to represent four reaction rules:  initializing the series of polymerization reactions: two unbound copies of A form an initial dimer, or a polymer with just two monomers;  adding an unbound A to the “tail” of an existing polymer;  adding an existing polymer to the “tail” of an unbound A; and  adding an existing polymer to the “tail” of another polymer.To select any species that is bound at a component, we use the notation !+; for example, A(h!+,t) will select any A whose “head” is bound, whether it belongs to a chain of one or one million monomers.We will assume that the forward and reverse rates for each reaction occur at the same rate. For simplicity, we will set all forward and reverse reaction rates to be equal to 0.01.We will initialize our simulation with 1000 unbound A monomers and observe the formation of polymer chains of a few different lengths (1, 2, 3, 5, 10, and 20).  To do so, we can use an “observable” A == n to denote that a polymer contains n copies of A. We need to use Species instead of Molecules to select polymer patterns.begin species	A(h,t) 1000end speciesbegin observables	Species A1 A==1	Species A2 A==2	Species A3 A==3	Species A5 A==5	Species A10 A==10	Species A20 A==20	Species ALong A&gt;=30end observablesFor this model, the infinite number of possible interactions will slow down the Gillespie algorithm. For that reason, we will use an alternative to the Gillespie algorithm called network-free simulation, which tracks individual particles.After building the model, we can run our simulation with the following command (note that we do not need the generate_network() command):simulate({method=&gt;"nf", t_end=&gt;100, n_steps=&gt;1000})Exercise: Run the simulation. How do the concentrations of polymers vary according to the lengths of the polymers?Exercise: What happens to polymer concentrations as we change the polymer binding and dissociation rates? Does your observation reflect what you might expect?Next module            Saragosti J, Calvez V, Bournaveas, N, Perthame B, Buguin A, Silberzan P. 2011. Directional persistence of chemotactic bacteria in a traveling concentration wave. PNAS. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Coronavirus Exercises Part 1",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/exercises_part_1",
        "date"     : "",
        "content"  : "Determining a shape’s center of mass mathematicallyIn the main text, we noted that the center of some shapes can be computed mathematically. Consider the semicircular arc shown in the figure below, with endpoints (-1, 0) and (1, 0).A semicircular arc with radius 1 corresponding to a circle whose center is at the origin.Exercise: Determine the center of mass of the shape in the figure above. (Hint: finding the x-coordinate of the center of mass is easy, but finding the y-coordinate requires a little calculus.)Exercise: Say that we connect (-1, 0) and (0, 1) to form a closed semicircle. What will be the center of mass of the resulting shape?Calculating RMSDConsider the two shapes shown in the figure below with vectorizations of eight points each.Two hypothetical protein structures with vectorizations into eight points each.Exercise: Using the vectorization of the figures indicated, estimate the center of mass of these two protein structures.Exercise: Without rotating these structures, align the two structures so that they have the same center of mass, and determine the RMSD between the two structures for the vectorizations shown.Practicing Ab initio and homology modelingLet’s predict the structure of the beta subunit of human hemoglobin subunit alpha using both ab initio and homology structure prediction. To do so, first visit the Protein Data Bank and search for the protein “1SI4”. Download the PDB file by clicking on “Download Files” and then “PDB Format”. We will use this file for structure comparisons later. Next, go to the “Sequence” tab and click “Display Files” and then “FASTA Sequence”. In this file, find the sequence corresponding to the beta subunit.Exercise: Submit the beta subunit sequence to the ab initio structure prediction software, QUARK as well as your choice of homology modeling software: SWISS-MODEL, Robetta, and/or GalaxyWEB.In a tutorial, we saw how to use ProDy to compute RMSD between protein structures.Exercise: Use ProDy to calculate the RMSD between your predicted structures from the previous exercise and the actual structure. (Use “Chain B” of the validated structure to focus on subunit beta.) Which type of modeling (ab initio or homology modeling) resulted in the most accurate structure prediction? Is this what you expected?Trying out AlphaFoldA simplified version of AlphaFold is available for public use on Colab. This version currently does not work when using the entire spike protein, and so we will use the human hemoglobin subunit beta once again. Following the directions from a preceding exercise, obtain the protein sequence of the beta subunit. Next, open Colab’s simplified version of AlphaFold. Read the documentation and follow the directions in each step to generate a predicted structure.Exercise: Use ProDy to calculate the RMSD between the predicted structure and the actual structure. Did this simplified version of AlphaFold perform better than your ab initio and homology modeling results from the previous exercise?"
     
   } ,
  
   {
     
        "title"    : "Coronavirus Exercises Part 2",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/exercises_part_2",
        "date"     : "",
        "content"  : "Comparing structures with QresRecall the following two toy protein structures from the part 1 exercises.Two hypothetical protein structures with vectorizations into eight points each.The formula for Qres of the i-th sampled point in a protein with N amino acids is reproduced below:\[Q_{res}^{(i)} = \dfrac{1}{N-k} \sum^{residues}_{j\neq i-1,i,i+1} \textrm{exp}[-\dfrac{[d(s_i,s_j)-d(t_i,t_j)]^2}{2\sigma^2_{i,j}}]\, .\]Exercise:  Assuming that the first sampled point in the two figures are, respectively, (1,2) and (1,3), compute the Qres of the fourth sampled point.In a tutorial, we computed Qres between SARS and SARS-CoV-2 RBD to find local differences between the two structures. We will repeat this analysis with hemoglobin subunit alpha. Using the PDB files of human hemoglobin (1SI4) and mako shark hemoglobin (3MKB), upload the protein structures into VMD. Align the structures using “Multiseq” and visualize the structures using Qres as the coloring parameter.Exercise:  Use the visualizations to determine where the protein structures vary the most. Do your findings make sense intuitively?Calculating interaction energyWe will perform a short analysis of the SARS-CoV-2 spike protein’s interaction energy with the human receptor ACE2 using NAMD. Download the PDB file of the SARS-CoV-2 RBD-ACE2 complex (PDB entry: 6VW1) and download 6VW1.pdb. Upload the structure into VMD and navigate to Extension &gt; Analysis &gt; NAMD Energy. Calculate the energy types “VDW” and “Elec” using the following selections:                   Selection 1      Selection 2                  A      protein and chain B      protein and chain F and resid 482 to 486              B      protein and chain B      protein and chain F and resid 383 to 394      Exercise: How do the results of part A compare with those of part B? Can you justify the results? (Hint: Visualize the specific residues).Note: For help, visit the module tutorials for VMD visualization and NAMD Energy. For additional troubleshooting, download NamdEnergyAssignment.zip and follow README.txt.Visualizing glycansIn the conclusion of part 2 of this module, we mentioned that the surface of viruses and host cells are “fuzzy” because they are covered by structures called glycans. We will use what we learned about VMD in a tutorial to visualize glycans on the spike protein of both viruses. First, we will need to download the protein structures of the SARS-CoV spike protein (PDB entry: 5xlr and the SARS-CoV-2 spike protein (PDB entry: 6vxx. Load each structure into VMD and navigate to Graphics &gt; Representations. For VMD, there is no specific keyword to select glycans, and so we will use a workaround with the keywords: "not protein and not water".Exercise: Create the representation. Assuming that all the non-proteins are glycans, which structure contains the most number of glycans? Do you think that this supports SARS-CoV-2’s higher infectivity compared to SARS-CoV?Contact mapsExercise: In our GNM tutorial, we created the contact map using a threshold distance of 20 angstroms. Try making the contact map of one of the chains of SARS-CoV-2 spike protein 6vxx using thresholds of 10 angstroms and 40 angstroms. How different are the resulting contact maps?"
     
   } ,
  
   {
     
        "title"    : "The Feedforward Loop Motif",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/feedforward",
        "date"     : "",
        "content"  : "Feedforward loopsIn the previous lesson, we saw that negative autoregulation can be used to lower the response time of a protein to an external stimulus. But the cell can only use autoregulation to respond quickly if the autoregulated protein is itself a transcription factor. Only about 300 of the 4,400 total E. coli proteins are transcription factors1. How can the cell speed up the manufacture of a protein that is not a transcription factor?The answer lies in another transcription factor network motif called the feedforward loop (FFL). An FFL, shown in the figure below, is a network substructure in which X is connected to both Y and Z, and Y is connected to Z. Calling the FFL motif a “loop” is a misnomer. Rather, it is a small network structure with two paths from X to Z; one via direct regulation of Z by X, and another with an intermediate transcription factor Y. Note that X and Y must be transcription factors because they have edges leading out from them, but Z does not have to be a transcription factor (and typically is not).The FFL motif. X regulates both Y and Z, and Y regulates Z.There are 42 FFLs in the transcription factor network of E. coli2, of which five have the structure below, in which X activates Y, X activates Z, and Y represses Z. This specific form of the FFL motif is  called a type-1 incoherent feedforward loop and will be our focus for the rest of the module.STOP: How could we simulate a type-1 incoherent feedforward loop with a particle-based reaction-diffusion model akin to the simulation that we used for negative autoregulation? What would we compare this simulation against?The incoherent feed-forward loop network motif. X activates Y and Z (the associated edges are labeled with “+”), while Y represses Z (the associated edge is labeled with “-“).Modeling a type-1 incoherent feedforward loopAs we did for negative autoregulation, we will simulate two cells and examine the concentration of a particle of interest, which in this case will be Z. The first cell will have a simple activation of Z by X; we will assume that X starts at its steady state concentration and that Z is produced by the reaction X  X + Z and removed by a degradation reaction.The second cell will include both of these reactions in addition to the reaction X  X + Y to model the activation of Y by X, along with the new reaction Y + Z  Y to model the repression of Z by Y. Because Y and Z are being produced from a reaction, we will also have degradation reactions for Y and Z. For the sake of fairness, we will use the same kill rates for both Y and Z.Furthermore, to obtain a mathematically controlled comparison, the reaction X  X + Z should have a higher rate in the second simulation modeling the FFL. If we do not raise the rate of this reaction, then the repression of Z by Y will cause the steady state concentration of Z to be lower in the second simulation.If you are feeling adventurous, then you may like to adapt the negative autoregulation tutorial to run the above two simulations and tweak the rate of the X  X + Z reaction to see if you can obtain the same steady state concentration of Z in the two simulations. We also provide the following tutorial guiding you through setting up these simulations, which we will interpret in the next section.Visit tutorialWhy feedforward loops speed up response timesThe figure below shows a plot visualizing the concentration of Z across the two simulations. As with negative autoregulation, the type-1 incoherent FFL allows the cell to ramp up production of a protein Z to steady state much faster than it could under simple regulation.The concentration of Z in the two simulations. Simple activation of Z by X is shown in blue, and the type-1 incoherent FFL is shown in purple.{: style=”font-size: medium;”}sNote the different pattern to the growth of Z than we saw under negative autoregulation. When modeling negative autoregulation, the concentration of the protein approached steady state from below. In the case of the FFL, the concentration of Z grows so quickly that it passes its eventual steady state concentration and then returns to this steady state from above.We can interpret from the model why the FFL allows for a fast response time as well as why it initially passes the steady state concentration. At the start of the simulation, Z is activated by X very quickly. X activates Y as well, but at a lower rate than the regulation of Z because Y only has its own degradation to slow this process. Therefore, more Z is initially produced than Y, which causes the concentration of Z to shoot past its eventual steady state.The figure above is reminiscent of a damped oscillation process like the one in the figure below, in which the concentration of a particle oscillates above and below a steady state, while the amplitude of the wave gets smaller and smaller. We wonder: is it possible for a network motif to produce more of a true “wave” without dampening?In a damped oscillation, the value of some variable (shown on the y-axis) oscillates back and forth around an asymptotic value while the amplitude decreases over time (shown on the x-axis).3Next lesson            Gene ontology database with “transcription” keyword: https://www.uniprot.org/. &#8617;              Mangan, S., &amp; Alon, U. (2003). Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences of the United States of America, 100(21), 11980–11985. https://doi.org/10.1073/pnas.2133841100 &#8617;              https://www.toppr.com/guides/physics/oscillations/damped-simple-harmonic-motion/ &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Stochastic Simulation of Chemical Reactions",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/gillespie",
        "date"     : "",
        "content"  : "Verifying a theoretical steady state concentration via stochastic simulationIn the previous module, we saw that we could avoid tracking the positions of individual particles if we assume that the particles are well-mixed, i.e., uniformly distributed throughout their environment. We will apply this assumption in our current work as well, in part because the E. coli cell is so small. As a proof of concept, we will see if a well-mixed simulation replicates a reversible reaction’s equilibrium concentrations of particles that we found in the previous lesson.Even though we can calculate steady state concentrations manually, a particle-free simulation will be useful for two reasons. First, this simulation will give us snapshots of the concentrations of particles in the system over multiple time points and allow us to see how quickly the concentrations reach equilibrium. Second, we will soon expand our model of chemotaxis to have many particles and reactions that depend on each other, and direct mathematical analysis of the system will become impossible.Note: The difficulty posed to precise analysis of systems with multiple chemical reactions is comparable to the famed “n-body problem” in physics. Predicting the motions of two celestial objects interacting due to gravity can be done exactly, but once we add more bodies to the system, no exact solution exists, and we must rely on simulation.Our particle-free model will apply an approach called Gillespie’s stochastic simulation algorithm, which is often called the Gillespie algorithm or just SSA for short. Before we explain how this algorithm works, we will take a short detour to provide some needed probabilistic context.The Poisson and exponential distributionsImagine that you own a store and have noticed that on average, λ customers enter your store in a single hour. Let X denote the number of customers entering the store in the next hour; X is an example of a random variable because its value may change depending on random chance. If we assume that customers are independent actors, then X follows a Poisson distribution. It can be shown that for a Poisson distribution, the probability that exactly n customers arrive in the next hour is\[\mathrm{Pr}(X = n) = \dfrac{\lambda^n e^{-\lambda}}{n!}\,,\]where e is the mathematical constant known as Euler’s number and is equal to 2.7182818284…Note: A derivation of the above formula for Pr(X = n) is beyond the scope of our work here, but if you are interested in one, please check out this article by Andrew Chamberlain.Furthermore, the probability of observing exactly n customers in t hours, where t is an arbitrary positive number, is\[\dfrac{(\lambda t)^n e^{-\lambda t}}{n!}\,.\]We can also ask how long we will typically have to wait for the next customer to arrive. Specifically, what are the chances that this customer will arrive after t hours? If we let T be the random variable corresponding to the wait time on the next customer, then the probability of T being at least t is the probability of seeing zero customers in t hours:\[\mathrm{Pr}(T &gt; t) = \mathrm{Pr}(X = 0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t}\,.\]In other words, the probability Pr(T &gt; t) that the wait time is longer than time t decays exponentially as t increases. For this reason, the random variable T is said to follow an exponential distribution. It can be shown that the expected value of the exponential distribution (i.e., the average amount of time that we will need to wait for the next event to occur) is 1/λ.STOP: What is the probability Pr(T &lt; t)?The Gillespie algorithmWe now return to explain the Gillespie algorithm for simulating multiple chemical reactions in a well-mixed environment. The engine of this algorithm runs on a single question: given a well-mixed environment of particles and a reaction involving those particles taking place at some average rate, how long should we expect to wait before this reaction occurs somewhere in the environment?This is the same question that we asked in the previous discussion; we have simply replaced customers entering a store with instances of a chemical reaction. The average number λ of occurrences of the reaction in a unit time period is the rate r at which the reaction occurs. Therefore, an exponential distribution with average wait time 1/r can be used to model the time between instances of the reaction.Next, say that we have two reactions proceeding independently of each other and occurring at rates r1 and r2. The combined average rates of the two reactions is r1 + r2, which is also a Poisson distribution. Therefore, the wait time required to wait for either of the two reactions is exponentially distributed, with an average wait time equal to 1/(r1 + r2).Numerical methods allow us to generate a random number simulating the wait time of an exponential distribution. By repeatedly generating these numbers, we can obtain a series of wait times between consecutive reaction occurrences.Once we have generated a wait time, we should determine the reaction to which it corresponds. If the rates of the two reactions are equal, then we simply choose one of the two reactions randomly with equal probability. But if the rates of these reactions are different, then we should choose one of the reactions via a probability that is weighted in direct proportion to the rate of the reaction; that is, the larger the rate of the reaction, the more likely that this reaction corresponds to the current event.1 To do so, we select the first reaction with probability r1/(λ1 + r2) and the second reaction with probability r2/(r1 + r2). (Note that these two probabilities sum to 1.)As illustrated in the figure below, we will demonstrate the Gillespie algorithm by returning to our ongoing example, in which we are modeling the forward and reverse reactions of ligand-receptor binding and dissociation. These reactions have rates rbind = kbind · [L] · [T] and rdissociate = kdissociate · [LT], respectively.First, we choose a wait time according to an exponential distribution with mean value 1/(rbind + rdissociate). Then, the probability that the event corresponds to a binding reaction is given byPr(L + T  LT) = rbind/(rbind + rdissociate),and the probability that it corresponds to a dissociation reaction isPr(LT  L + T) = rdissociate/(rbind + rdissociate).A visualization of a single reaction event used by the Gillespie algorithm for ligand-receptor binding and dissociation. Red circles represent ligands (L), and orange wedges represent receptors (T). The wait time for the next reaction is drawn from an exponential distribution with mean 1/(kbind + kdissociate). The probability of this event corresponding to a binding or dissociation reaction is proportional to the rate of the respective reaction.To generalize the Gillespie algorithm to n reactions occurring at rates r1, r2, …, rn, the wait time between reactions will be exponentially distributed with average 1/(r1 + r2 +  + rn). Once we select the next reaction to occur, the likelihood that it is the i-th reaction is equal tori /(r1 + r2 +  + rn).Throughout this module, we will employ BioNetGen to apply the Gillespie algorithm to well-mixed models of chemical reactions. We will use our ongoing example of ligand-receptor binding and dissociation to introduce the way in which BioNetGen represents molecules and reactions involving them. The following tutorial shows how to implement this rule in BioNetGen and use the Gillespie algorithm to determine the equilibrium of a reversible ligand-receptor binding reaction.Visit tutorialDoes the Gillespie algorithm confirm our steady state calculations?In the previous lesson, we showed an example in which a system with 10,000 free ligand molecules and 7,000 free receptor molecules produced the following steady state concentrations using the experimentally verified binding rate of kbind = 0.0146((molecules/µm3)-1)s-1 and dissociation rate of kdissociate = 35s-1:  [LT] = 4,793 molecules/µm3;  [L] = 5,207 molecules/µm3;  [T] = 2,207 molecules/µm3.Our model uses the same number of initial molecules and the same reaction rates. The system evolves via the Gillespie algorithm, and we track the concentration of free ligand molecules, ligand molecules bound to receptor molecules, and free receptor molecules over time.The figure below demonstrates that the Gillespie algorithm quickly converges quickly to the same values calculated just above. Furthermore, the system reaches steady state in a fraction of a second.A concentration plot over time for ligand-receptor dynamics via a BioNetGen simulation employing the Gillespie algorithm. Time is shown (in seconds) on the x-axis, and concentration is shown (in molecules/µm3) on the y-axis. The molecules quickly reach steady state concentrations that match those identified by hand.This simple ligand-receptor model is just the beginning of our study of chemotaxis. In the next section, we will delve into the complex biochemical details of chemotaxis. Furthermore, we will see that the Gillespie algorithm for stochastic simulations will scale easily as our model of this system grows more complex.Next lesson            Schwartz R. Biological Modeling and Simulaton: A Survey of Practical Models, Algorithms, and Numerical Methods. Chapter 17.2. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Glycans",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/glycans",
        "date"     : "",
        "content"  : "The surface of viruses and host cells are not smooth, but rather “fuzzy”. This is because the surface is decorated by structures called glycans, which consists of numerous monosaccharides linked together by glycosidic bonds. Although this definition is also shared with polysaccharides, glycans typically refer to the carbohydrate portion of glycoproteins, glycolipids, or proteoglycans 1. Glycans have been found to have structural and modulatory properties and are crucial in recognition events, most commonly by glycan-binding proteins (GBPs) 2. In viral pathogenesis, glycans on host cells act as primary receptors, co-receptors, or attachment factors that are recognized by viral glycoproteins for viral attachment and entry. On the other hand, glycans on viral surfaces are key for viral recognition by the host immune system 3. Unfortunately, some viruses have evolved methods that allow them to effectively conceal themselves from the immune system. One such method, which is used by SARS-CoV-2, is a “glycan shield”, where glycosylation of surface antigens allows the virus to hide from antibody detection. Another notorious virus that uses glycan shielding is HIV. The Spike protein of SARS-CoV-2 was also found to be heavily glycosylated, shielding around 40% of the Spike protein from antibody recognition 4. Such glycosylation does not hinder the Spike protein’s ability to interact with human ACE2 because the Spike protein is able to adopt an open conformation, allowing a large portion of the RBD being exposed.Glycans are generally very flexible and have large internal motions that makes it difficult to get an accurate description of their 3D shapes. Fortunately, molecular dynamics (MD) simulations can be employed to predict the motions and shapes of the glycans. With a combination of MD and visualization tools (i.e. VMD), snapshots of the Spike protein with its glycosylation can be created.Nonetheless, basic visualizations of the Spike protein with its glycans can be made using just VMD. Here, we used SARS-CoV-2 Spike in its closed conformation (6vxx)) and SARS-CoV-2 Spike in its open conformation (6vyb) to create the following images. Notice how the RBD in the orange chain is much more exposed in the open conformation. The presumed glycans are shown in red.To see how to visualize glycans in VMD, go to the following tutorial.Visit tutorialNext lessonGlycansThe surface of viruses and host cells are not smooth, but rather “fuzzy”. This is because the surface is decorated by structures called glycans, which consists of numerous monosaccharides linked together by glycosidic bonds. Although this definition is also shared with polysaccharides, glycans typically refer to the carbohydrate portion of glycoproteins, glycolipids, or proteoglycans [^Dwek]. Glycans have been found to have structural and modulatory properties and are crucial in recognition events, most commonly by glycan-binding proteins (GBPs) [^Varki]. In viral pathogenesis, glycans on host cells act as primary receptors, co-receptors, or attachment factors that are recognized by viral glycoproteins for viral attachment and entry. On the other hand, the immune system can recognize foreign glycans on viral surfaces and target the virus [^Raman]. Unfortunately, some viruses have evolved methods that allow them to effectively conceal themselves from the immune system. One such method is a glycan shield. By covering the viral surface and proteins with glycans, the virus can physically shield itself from antibody detection. Because the virus replicates by hijacking the host cells, the glycan shield can consist of host glycans and mimic the surface of a host cell. A notorious virus that uses glycan shielding is HIV. In the case of SARS-CoV-2, the immune system recognizes the virus through specific areas, or antigens, along the S protein. The S protein, however, is a glycoprotein, meaning that it is covered with glycans which can shield the S protein antigens from being recognized.In our last tutorial, we will use VMD to try to visualize the glycans of SARS-CoV-2 S protein.Visit tutorialFrom the visualization we created in the tutorial, we can see that glycans are present all around the S protein. In fact, the glycans cover around 40% of the Spike protein[^Grant]! This raises an important question: If the glycans on the S protein can hide from antibodies, won’t it get in the way of binding with ACE2? Such glycosylation does not hinder the Spike protein’s ability to interact with human ACE2 because the Spike protein is able to adopt an open conformation, allowing a large portion of the RBD being exposed. In the figure below, we compared the SARS-CoV-2 Spike in its closed conformation (PDB entry: 6vxx)) and SARS-CoV-2 Spike in its open conformation (PDB entry: 6vyb). The presumed glycans are shown in red. Notice how the RBD in the orange chain is much more exposed in the open conformation.This figure shows the SARS-CoV-2 S protein in the closed conformation (left) and the protein with an open conformation of one chain (right) using the PDB entries 6vxx and 6vyb, respectively. The protein chains are shown in dark orange, yellow, and green. The presumed glycans are shown in red. Notice how in the open conformation, the RBD of one of the chain is pointed upwards, exposing it for ACE2 interactions.Glycans are generally very flexible and have large internal motions that makes it difficult to get an accurate description of their 3D shapes. Fortunately, molecular dynamics (MD) simulations can be employed to predict the motions and shapes of the glycans. With a combination of MD and visualization tools (i.e. VMD), very nice looking snapshots of the glycans on the S protein can be created.Snapshots from molecular dynamics simulations of the SARS-CoV-2 S protein with different glycans shown in green, yellow, orange, and pink. Source: https://doi.org/10.1101/2020.04.07.030445 [^Grant]SARS-CoV-2 VaccineMuch of vaccine development for SARS-CoV-2 has been focused on the S protein given how it facillitates the viral entry into host cells. In vaccine development, it is critical to understand every strategy that the virus employs to evade immune response. As we have discussed, SARS-CoV-2 hides its S protein from antibody recognition through glycosylation, creating a glycan shield around the protein. In fact, the “stalk” of the S protein has been found to be completely shielded from antibodies and other large molecules. In contrast, the “head” of the S protein is vulnerable because of the RBD is less glycosylated and becomes fully exposed in the open conformation. Thus, there is an opportunity to design small molecules that target the head of the protein [^Casalino]. Glycan profiling of SARS-CoV-2 is extremely important in guiding vaccine development as well as improving COVID-19 antigen testing [^Watanabe].Sources            Dwek, R.A. Glycobiology: Toward Understanding the Function of Sugars. Chem. Rev. 96(2),  683-720 (1996). https://pubs.acs.org/doi/10.1021/cr940283b &#8617;              Varki A, Lowe JB. Biological Roles of Glycans. In: Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology. 2nd edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2009. Chapter 6. https://www.ncbi.nlm.nih.gov/books/NBK1897/ &#8617;              Raman, R., Tharakaraman, K., Sasisekharan, V., &amp; Sasisekharan, R. Glycan-protein interactions in viral pathogenesis. Current opinion in structural biology, 40, 153–162 (2016). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5526076/ &#8617;              Grant, O. C., Montgomery, D., Ito, K., &amp; Woods, R. J. Analysis of the SARS-CoV-2 spike protein glycan shield: implications for immune recognition. bioRxiv : the preprint server for biology, 2020.04.07.030445. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7217288/ &#8617;      "
     
   } ,
  
   {
     
        "title"    : "From Static Protein Analysis to Molecular Dynamics",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/gnm",
        "date"     : "",
        "content"  : "Modeling proteins using tiny springsYou may think that simulating the movements of proteins with hundreds of amino acids will prove hopeless. After all, predicting the static structure of a protein has occupied biologists for decades! Yet part of what makes structure prediction so challenging is that the search space of potential structures is enormous. In contrast, once we have established the static structure of a protein, it will not deviate greatly from this static structure, and so the space of potential dynamic structures is narrowed down to those that are similar to the static structure.A protein’s molecular bonds are constantly vibrating, stretching and compressing, much like that of the oscillating mass-spring system shown in the figure below. Bonded atoms are held at a specific distance apart due to the attraction and repulsion of the negatively charged electrons and positively charged nucleus. If you were to push the atoms closer together or pull them farther apart, then they would “bounce back” to their equilibrium.A mass-spring system in which a mass is attached to the end of a spring. The more that we move the mass from its equilibrium, the more that it will be repelled back toward equilibrium. Image courtesy: flippingphysics.com.In an elastic network model (ENM), we imagine nearby alpha carbons of a protein structure to be connected by springs. Because distant atoms will not influence each other, we will only connect two alpha carbons if they are within some threshold distance of each other. In this lesson, we will describe a Gaussian network model (GNM), an ENM for molecular dynamics.Representing random movements of alpha carbonsWe will introduce GNMs using our old friend human hemoglobin (1A3N.pdb). We first convert hemoglobin into a network of nodes and springs, in which each alpha carbon is given by a node, and two alpha carbons are connected by a string if they are within a threshold distance; the figure below uses a threshold value of 7.3 angstroms.Conversion of human hemoglobin (left) into a network of nodes and springs (right) in which two nodes are connected by a spring if they are within a threshold distance of 7.3 angstroms.The alpha carbons in a protein are subject to random fluctuations that cause them to move from their equilibrium positions. These fluctuations are Gaussian, meaning that the alpha carbon deviates randomly from its equilibrium position according to a normal (bell-shaped) distribution. In other words, although an alpha carbon’s position is due to random chance, it is more likely to be near the equilibrium than far away.Although atomic fluctuations are powered by randomness, the movements of protein atoms are heavily correlated. For example, imagine the simple case in which all of a protein’s alpha carbons are connected in a straight line. If we pull the first alpha carbon away from the center of the protein, then the second alpha carbon will be pulled along with it. Our goal is to understand how the movements of every pair of alpha carbons may be related.Inner products and cross-correlationsAs illustrated in the figure below, the variable vector \(\mathbf{R_{ij}}\) represents the distance between nodes i and j. The equilibrium position of node \textvar{i} is represented by the vector \(\mathbf{R_i^0}\), and its displacement from equilibrium is denoted by the (variable) vector \(\mathbf{\mathbf{\Delta R_i}}\). The distance between node \textvar{i} and node \textvar{j} at equilibrium is denoted by the vector \(\mathbf{R_{ij}^0}\), which is equal to \(\mathbf{R_j^0} - \mathbf{R_i^0}\).(Left) A small network of nodes connected by springs deriving from a protein structure. The distance between two nodes i and j is denoted by the variable \(\mathbf{R_{ij}}\). (Right) Zooming in on two nodes i and j that are within the threshold distance and therefore connected by a spring. The equilibrium positions of node i and node j are represented by the distance vectors \(\mathbf{R_i^0}\) and \(\mathbf{R_j^0}\), with the distance between them denoted \(\mathbf{R_{ij}^0}\), which is equal to \(\mathbf{R_j^0} - \mathbf{R_i^0}\). The vectors \(\mathbf{\mathbf{\Delta R_i}}\) and \(\mathbf{\mathbf{\Delta R_j}}\) represent the nodes’ respective changes from equilibrium. Image courtesy: Ahmet Bakan.To determine how the movements of alpha carbons i and j are related, we need to study the fluctuation vectors \(\mathbf{\mathbf{\Delta R_i}}\) and \(\mathbf{\mathbf{\Delta R_j}}\). Do these vectors point in similar or opposing directions?To answer this question, we compute the inner product, or dot product, of the two vectors, \(\langle \mathbf{\mathbf{\Delta R_i}}, \mathbf{\mathbf{\Delta R_j}} \rangle\). The inner product of two vectors x = (x1, x2, x3) and y = (y1, y2, y3) is given by &lt;x, y&gt; = x1 · y1 + x2 · y2 + x3 · y3. If x and y are perpendicular, then &lt;x, y&gt; is equal to zero. The more that the two vectors point in the same direction, the larger the value of &lt;x, y&gt;. And the more that the two vectors point in opposite directions, the more negative the value of &lt;x, y&gt;.STOP: Say that x = (1, -2, 3), y = (2, -3, 5), and z = (-1, 3, -4). Compute the inner products &lt;x, y&gt; and &lt;x, z&gt;. Ensure that your answers match the preceding observation about the value of the inner product and the directions of vectors.The inner product is also useful for representing  the mean-square fluctuation of an alpha carbon, or the expected squared distance of node i from equilibrium. If the fluctuation vector \(\mathbf{\Delta R_i}\) is represented by the coordinates (x, y, z), then its square distance from equilibrium is x2 + y2 + z2, which is the inner product of the fluctuation vector with itself, \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\). Alpha carbons having large values of this mean-square fluctuation may belong to more flexible regions of the protein.Note: In practice, we will not know the exact values for the \(\mathbf{\Delta R_i}\) and must compute the inner products of fluctuation vectors indirectly using linear algebra, which is beyond the scope of this work. A full treatment of the mathematics of GNMs can be found in the chapter at https://www.csb.pitt.edu/Faculty/bahar/publications/b14.pdfLong vectors pointing in the same direction will have a larger inner product than short vectors pointing in the same direction. To have a metric for the correlation of two vectors’ movements that is independent of their length, we should therefore normalize the inner product. The cross-correlation of alpha carbons i and j is given by\[C_{ij} = \dfrac{\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_j} \rangle}{\sqrt{\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle \cdot \langle \mathbf{\Delta R_j}, \mathbf{\Delta R_j} \rangle}}.\]After normalization, the cross-correlation ranges from -1 to 1. A cross-correlation of -1 means that the two alpha carbons’ movements are completely anti-correlated, and a cross-correlation of 1 means that their movements are completely correlated.After computing the cross-correlation of every pair of alpha carbons in a protein structure with n residues, we obtain an n × n cross-correlation matrix C such that C(i, j) is the cross-correlation between amino acids i and j. We can visualize this matrix using a heat map, in which we color matrix values along a spectrum from blue (-1) to red (1). The heat map for the cross-correlation matrix of human hemoglobin is shown in the figure below.The normalized cross-correlation heat map of human hemoglobin (PDB: 1A3N). Red regions indicate correlated residue pairs which move in the same direction; blue regions indicate anti-correlated residue pairs which move in opposite directions.The cross-correlation map of a protein contains complex patterns of correlated and anti-correlated movement of the protein’s atoms. For example, we should not be surprised that the main diagonal of the above cross-correlation heat map is colored red, since alpha carbons that are near each other in a polypeptide chain will likely have correlated movements. Furthermore, the regions of high correlation near the main diagonal typically provide information regarding the protein’s secondary structures, since amino acids belonging to the same secondary structure will typically move in concert.High correlation regions that are far away from the matrix’s main diagonal provide information about the tertiary structure of the protein, such as domains that are distant in terms of sequence but work together in the protein structure.In the cross-correlation heat map in the figure above, we can see four squares of positive correlation along the main diagonal, representing the four subunits of hemoglobin from bottom left to top right: α1, β1, α2, and β2. Note that the first and third squares exhibit the same patterns; these squares correspond to α1 and α2, respectively. Similarly, the second and fourth squares correspond to β1 and β2, respectively.STOP: What other patterns do you notice in the hemoglobin cross-correlation heat map?Mean-square fluctuations and B-factorsJust as we can visualize cross-correlation, we would also like to visualize the mean-square fluctuations that the GNM model predicts. During X-ray crystallography, the displacement of atoms within the protein crystal decreases the intensity of the scattered X-ray, creating uncertainty in the positions of atoms. A B-factor, also known as temperature factor or Debye-Waller factor, is a measure of this uncertainty.It is beyond the scope of this work, but the theoretical B-factor of the i-th alpha carbon is equal to a constant factor times the estimated mean-square fluctuation,\[B_i = \frac{8 \pi^2}{3} \langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle.\]Once we have computed theoretical B-factors, we can compare them against experimental results. The figure below shows a 3-D structure of the hemoglobin protein, with amino acids colored along a spectrum according to both theoretical and experimental B-factors for the sake of comparison.  In these structures, red indicates large B-factors (mobile amino acids), and blue indicates small B-factors (static amino acids). Note that the mobile amino acids are generally found at the ends of secondary structures in the outer edges of the protein, which is expected because the boundaries between secondary structures typically contain highly fluctuating residues. This figure confirms the general observation that theoretical B-factors tend to correlate well with experimental B-factors in practice1 and offers one more example of a property of a biological system that we can infer computationally without the need for costly experimentation.It is beyond the scope of this work, but the theoretical B-factors are given byIn other words, once we can estimate the mean-square fluctuations \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\), the theoretical B-factor of the i-th alpha carbon is equal to a constant times this inner product. This theoretical B-factor tends to correlate well with theoretical B-factors in practice.1(Top): Human hemoglobin colored according to theoretical B-factors calculated from GNM (left) and experimental B-factors (right). Blue indicates low B-factors, and red indicates high B-factors. Subunit α1 is located at the top left quarter of the protein. (Bottom): A 2-D plot comparing the theoretical (blue) and experimental (black) B-factors of subunit α1.  The theoretical and experimental B-factors are correlated with a coefficient of 0.63.Normal mode analysisWhen listening to your favorite song, you probably do not think of the individual notes that it comprises. Yet a talented musician can dissect the song into the set of notes that each instrument contributes to the whole. Just because the music combines a number of individual sound waves does not mean that we cannot deconvolve the music into its substituent waves.All objects, from colossal skyscrapers to tiny proteins, vibrate. And, just like in our musical example, these oscillations are the net result of individual waves passing through the object. The paradigm of breaking down a collection of vibrations into the comparatively small number of “modes” that summarize them is called normal mode analysis (NMA) .The mathematical details are complicated, but by deconvolving a protein’s movement into individual normal modes, we can observe how each mode affects individual amino acids. As we did with B-factors, for a given mode, we can visualize the results of a mode using a line graph, called the mode shape plot. The x-axis of this plot corresponds to the amino acid sequence of the protein, and the height of the i-th position on the x-axis corresponds to the magnitude of the square fluctuation contributed by the mode on the protein’s i-th amino acid.Just as a piece of music can have one instrument that is much louder than another, some of the oscillations contributing to an object’s vibrations may be more significant than others. NMA also determines the degree to which each mode contributes to the overall fluctuations of a protein; the mode contributing the most is called the slowest mode of the protein. The figure below shows a mode shape plot of the slowest mode for each of the four subunits of human hemoglobin and reveals that all four subunits have similar mode shape for the slowest mode.(Top) Visualization of human hemoglobin colored based on GNM slow mode shape for the slowest mode (left) and the average of the ten slowest modes (right), or the ten modes that contribute the most to the square fluctuation. Regions of high mobility are colored red, corresponding to peaks in the mode shape plot. (Bottom) A mode shape plot of the slowest mode for human hemoglobin, separated over each of the four chains, shows that the four chains have a similar slowest mode.Similar to cross-correlation, analyzing a protein’s mode shapes will give insights into the structure of the protein, and comparing mode shapes for two proteins can reveal differences. For example, the mode shape plots in the figure above show that the slowest mode shape for the four subunits of hemoglobin are quite similar.We should consult more than just a single mode when completing a full analysis of a protein’s molecular dynamics. Below is the slow mode plot averaging the “slowest” ten modes of hemoglobin, meaning the ten modes that have the greatest effect on mean fluctuation. Unlike when we examined only the slowest mode, we can now see a stark difference when comparing α subunits (chains A and C) to β subunits (chains B and D).The average mode shape of the slowest ten modes for each of the four human hemoglobin subunits using GNM. Note that the plots for α subunits (chains A and C) and β subunits (chains B and D) differ more than when considering only the slowest mode.We are now ready to apply what we have learned  to build a GNM for the SARS-CoV and SARS-CoV-2 spike proteins in a tutorial and analyze the dynamics of these proteins using the plots that we have introduced in this section.Visit tutorialMolecular dynamics analyses of SARS-CoV and SARS-CoV-2 spike proteins using GNMThe figure below shows the cross-correlation heat maps of SARS-CoV and SARS-CoV-2 spike proteins, indicating that these proteins may have similar dynamics.The cross-correlation heat maps of the SARS-CoV-2 spike protein (top-left), SARS-CoV spike protein (top-right), single chain of the SARS-CoV-2 spike protein (bottom-left), and single-chain of the SARS-CoV spike protein (bottom-right).The next figure shows the mode shape plot for the slowest mode of the two proteins. The protein region between positions 200 and 500 of the spike protein is the most mobile and overlaps with the RBD region, found between residues 331 to 524.(Top) A mode shape plot for the slowest mode of the SARS-CoV-2 spike protein (left) and SARS-CoV spike protein (right). (Bottom) A mode shape plot for the slowest mode of a single chain of the SARS-CoV-2 spike protein (left) and a single chain of the SARS-CoV spike protein (right). Note that the plot on the right is inverted compared to the one on the left because of a choice made by the software, but the two plots have the same shape if we consider the absolute value. These plots show that the two viruses have similar dynamics, and that residues 200  500 fluctuate the most, a region that overlaps heavily with the RBD.We can also examine the mode shape plot for the average of the slowest ten modes for the two spike proteins (see figure below). Using this plot, we color flexible parts of the protein red and inflexible parts of the protein blue.Average mode shape of the slowest ten modes of SARS-CoV-2 Spike (left) and SARS-CoV Spike (right). The first peak corresponds to the N-Terminal Domain (NTD) and the second peak corresponds to the RBD. Above the mode shape plots, the viral spike proteins are colored according to the value of mode shape, with high values colored red and indicating greater predicted flexibility; note that the SARS-CoV-2 NTD is predicted to be more flexible than that of SARS-CoV.The mode shape plots show that the RBD of both spike proteins are highly flexible, which agrees with the biological functions of these regions. When the RBD interacts with the ACE2 enzyme on human cells, the RBD of one of the three chains “opens up”, exposing itself to more easily bind with ACE2.The figure above highlights another flexible region, the spike protein’s non-terminal domain (NTD), which appears to be more mobile in SARS-CoV-2. Like the RBD, the NTD also mediates viral infection, but by interacting with DC-SIGN and L-SIGN receptors rather than ACE22. These receptors are present on macrophages and dendritic cells, allowing SARS-CoV-2 to infect different tissues such as the lungs, where ACE2 expression levels are low, which may help explain why SARS-CoV-2 is more likely than SARS-CoV to progress to pneumonia.The analysis provided by GNM is compelling, but it ultimately relies on the estimation of inner products \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_i} \rangle\) and \(\langle \mathbf{\Delta R_i}, \mathbf{\Delta R_j} \rangle\). Although the \(\mathbf{\Delta R_i}\) are vectors, meaning that they have a direction as well as a magnitude, the inner products have only a value, and so we therefore cannot infer anything about the direction of a protein’s movements. For this reason, GNM is said to be isotropic, meaning that it only considers the magnitude of force exerted on the springs between nearby molecules and ignores any global effect on the directions of these forces. In the next lesson, we will generalize GNMs to take these movements into account.Next lesson            Yang, L., Song, G., &amp; Jernigan, R. L. 2009. Comparisons of experimental and computed protein anisotropic temperature factors. Proteins, 76(1), 164–175. https://doi.org/10.1002/prot.22328 &#8617; &#8617;2              Soh, W. T., Liu, Y., Nakayama, E. E., Ono, C., Torii, S., Nakagami, H., Matsuura, Y., Shioda, T., Arase, H. The N-terminal domain of spike glycoprotein mediates SARS-CoV-2 infection by associating with L-SIGN and DC-SIGN. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Modeling a Bacterium&#39;s Response to an Attractant Gradient",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/gradient",
        "date"     : "",
        "content"  : "Traveling up an attractant gradientIn the previous lesson, we saw that E. coli is able to adapt its default tumbling frequency to the current background concentration of attractant. To model this behavior, we used the Gillespie algorithm and the rule-based language of BioNetGen to simulate an instantaneous increase in concentration from one stable concentration level to another.Yet imagine a glucose cube in an aqueous solution. As the cube dissolves, a gradient will form, with a glucose concentration that decreases outward from the cube. How will the tumbling frequency of E. coli change if the bacterium finds itself in an environment of an attractant gradient?  Will the tumbling frequency decrease continuously as well, or will we see more complicated behavior? And once the cell reaches a region of high attractant concentration, will its default tumbling frequency stabilize to the same steady state?In this lesson, we will modify our model from the previous lesson by increasing the concentration of the attractant ligand at an exponential rate and seeing how the concentration of phosphorylated CheY changes. This model will simulate a bacterium traveling up an attractant gradient toward an attractant. Moreover, we will examine how the concentration of phosphorylated CheY changes as we change the gradient’s “steepness”, or the rate at which attractant ligand is increasing. Visit the following tutorial if you’re interested in following our adjustments for yourself.Visit tutorialSteady state tumbling frequency is robustTo model a ligand concentration [L] that is increasing exponentially, we will use the function [L] = l0 · ek · t, where l0 is the initial ligand concentration, k is a constant dictating the rate of exponential growth, and t is the time. The parameter k represents the steepness of the gradient, since the higher the value of k, the faster the growth in the ligand concentration [L].For example, the following figure shows the concentration over time of phosphorylated CheY (shown in blue) when l0 = 1000 and k = 0.1. The concentration of phosphorylated CheY, and therefore the tumbling frequency, still decreases sharply as the ligand concentration increases, but after all ligands become bound to receptors (shown by the plateau in the red curve), receptor methylation causes the concentration of phosphorylated CheY to return to its equilibrium. In other words, for these values of l0 and k, the outcome is similar to when we provided an instantaneous increase in ligand, although the cell takes longer to reach its minimum concentration of phosphorylated CheY because the attractant concentration is increasing gradually.Plots of relevant molecule concentrations in our model (in number of molecules in the cell) over time (in seconds) when the concentration of ligand grows exponentially with l0 = 1000 and k = 0.1. The concentration of bound ligand (shown in red) quickly hits saturation, which causes a minimum in phosphorylated CheY (and therefore a low tumbling frequency). To respond, the cell increases the methylation of receptors, which boosts the concentration of phosphorylated CheY back to equilibrium.The following figure shows the results of multiple simulations in which we vary the growth parameter k and plot only the concentration of phosphorylated CheY over time. The larger the value of k, the faster the increase in receptor binding, and the steeper the drop in the concentration of phosphorylated CheY.Plots of the concentration of phosphorylated CheY over time (in seconds) for different growth rates k of ligand concentration. The larger the value of k, the steeper the initial drop in the concentration of phosphorylated CheY, and the faster that methylation returns the concentration of phosphorylated CheY to equilibrium. The same equilibrium is obtained regardless of the value of k.More importantly, the above figure further illustrates the robustness of bacterial chemotaxis to the rate of growth in ligand concentration. Whether the growth of the attractant is slow or fast, methylation will always bring the cell back to the same equilibrium concentration of phosphorylated CheY and therefore the same background tumbling frequency.Reversing the attractant gradientAnd what if a cell is moving away from an attractant, down a concentration gradient? We would hope that the cell would be able to increase its tumbling frequency (i.e., increase the concentration of phosphorylated CheY), and then restore the background tumbling frequency by removing methylation.To simulate a decreasing gradient, we will model a cell in a high ligand concentration that is already at steady state, meaning that methylation is also elevated. In this case, the ligand concentration will decay exponentially, meaning that the ligand concentration is still given by the equation [L] = l0 · ek · t, but k is negative.STOP: If k is negative, how do you think that decreasing the value of k will affect the concentration of phosphorylated CheY over time?You may like to modify the previous tutorial on your own to account for traveling down an attractant gradient. If not, we are still happy to provide a separate tutorial below.Visit tutorialSteady state tumbling frequency remains robust when traveling down an attractant gradientThe following figure plots the concentrations of molecules in our model as the concentration of attractant ligand decreases exponentially with l0 equal to 107 and k equal to -0.3. As the ligand concentration decreases, the concentration of bound ligands plummet as bound ligands dissociate and there are not enough free ligands to replace the dissociating ones. In the absence of ligand-receptor binding, CheY is free to phosphorylate, causing a spike in phosphorylated CheY. Demethylation of receptors then causes the concentration of phosphorylated CheY to steadily return back to its equilibrium.Molecular concentrations (in number of molecules in the cell) over time (in seconds) for a simulated bacterium traveling down an attractant gradient with l0 = 107 and k equal to -0.3. Phosphorylated CheY follows the opposite pattern to traveling up an attractant gradient, with the concentration of phosphorylated CheY rising quickly only to slowly decrease to equilibrium due to demethylation.To be thorough, we should also test the robustness of our model to see whether the CheY concentration will return to the same steady state for a variety of values of k when k is negative. As in the case of an increasing gradient, the figure below shows that the more sudden the change in the concentration of attractant (i.e., the more negative the value of k), the sharper the spike. And yet regardless of the value of k, methylation does its work to bring the concentration back to the same steady state, which has been confirmed by experimental observations.1Varying values of k in our exponential decrease in the concentration of attractant ligand produce the same equilibrium concentration of phosphorylated CheY. The smaller the value of k, the steeper the initial spike, and the faster the recovery to steady state.From changing tumbling frequencies to an exploration algorithmWe hope that our work here has conveyed the elegance of bacterial chemotaxis, as well as the power of rule-based modeling and the Gillespie algorithm for simulating a complex biochemical system that may include a huge number of reactions.And yet we are missing an important part of the story. E. coli has evolved to ensure that if it detects a relative increase in concentration (i.e., an attractant gradient), then it can reduce its tumbling frequency in response. But we have not explored why changing its tumbling frequency would help a bacterium find food in the first place. After all, according to the run and tumble model, the direction that a bacterium is moving at any point in time is random!This quandary does not have an obvious intuitive answer. In this module’s conclusion, we will build a model to explain why E. coli’s randomized run and tumble walk algorithm is such a clever way of locating resources in an unfamiliar land.Next lesson            Krembel A., Colin R., Sourijik V. 2015. Importance of multiple methylation sites in Escherichia coli chemotaxis. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "The Gray-Scott Model: A Turing Pattern Cellular Automaton",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/gray-scott",
        "date"     : "",
        "content"  : "Adding reactions to our diffusion automatonNow that we have established a cellular automaton for coarse-grained particle diffusion, we will add to it the three reactions that we introduced in the previous lesson, which are reproduced below.  A “feed” reaction in which new A particles are fed into the system at a constant rate.  A “death” reaction in which B particles are removed from the system at a rate proportional to their current concentration.  A “reproduction” reaction A + 2B  3B.STOP: How might we incorporate these reactions into our automaton?First, we have the feed reaction, which takes place at a rate f. It is tempting to simply add some constant value f to the concentration of A in each cell in each time step. However, if [A] were close to 1, then adding f to it could cause [A] to exceed 1, which we may wish to avoid.Instead, if a cell has current concentration [A], then we will add f(1-[A]) to this cell’s concentration of A particles. For example, if [A] is equal to 0.01, then we will add 0.99f to the cell because the current concentration is low. If [A] is equal to 0.8, then we will only add 0.2f to the current concentration of A particles.Second, we consider the death reaction of B particles, which takes place at rate k. Recall that k is proportional to the current concentration of B particles. As a result, we will subtract k · [B] from the current concentration of B particles.Third, we have the reproduction reaction A + 2B  3B, which takes place at a rate r. The higher the concentration of A and B, the more this reaction will take place. Furthermore, because we need two B particles in order for the collision to occur, the reaction should be more rare if we have a low concentration of B than if we have a low concentration of A. To model this situation, if a given cell is represented by the concentrations ([A], [B]), then we will subtract r · [A] · [B]2 from the concentration of A and add r · [A] · [B]2 to the concentration of B in the next time step.We now just need to combine these reactions with diffusion. Say that as the result of diffusion, the change in its concentrations are ΔA and ΔB, where a negative number represents particles leaving the cell, and a positive number represents particles entering the cell. Then in the next time step, the particle concentrations [A]new and [B]new are given by the following equations:[A]new = [A] + ΔA +  f(1-[A]) - r · [A] · [B]2;[B]new = [B] + ΔB - k · [B] + r · [A] · [B]2.Applying these reaction-diffusion computations over all cells in parallel and over many generations constitutes a cellular automaton called the Gray-Scott model.1Before continuing, let us consider an example of how a single cell might update its concentration of both particle types as a result of reaction and diffusion.  Say that we have the following hypothetical parameter values:  dA = 0.2;  dB = 0.1;  f = 0.3;  k = 0.4;  r = 1 (the value typically always used in the Gray-Scott model).Furthermore, say that our cell has the current concentrations ([A], [B]) = (0.7, 0.5). Then as a result of diffusion, the cell’s concentration of A will decrease by 0.7 · dA = 0.14, and its concentration of B will decrease by 0.5 · dB = 0.05. It will also receive particles from neighboring cells; for example, say that it receives an increase to its concentration of A by 0.08 and an increase to its concentration of B by 0.06 as the result of diffusion from neighbors. Therefore, the net concentration changes due to diffusion are ΔA = 0.08 - 0.14 = -0.06, and ΔB = 0.06 - 0.05 = 0.01.Now we will consider the three reactions. The feed reaction will cause the cell’s concentration of A to increase by (1 - [A]) · f = 0.09. The kill reaction will cause its concentration of B to decrease by k · [B] = 0.2. And the reproduction reaction will mean that the concentration of A decreases by [A] · [B]2 = 0.175, with the concentration of B increasing by the same amount.As the result of all these processes, we update the concentrations of A and B to the following values ([A]new, [B]new) in the next time step according to our equations above.[A]new = 0.7 - 0.06 + 0.09 - 0.175 = 0.555[B]new = 0.5 + 0.01 - 0.2 + 0.175 = 0.485We are now ready to implement the Gray-Scott model in the following tutorial. The question is: even though we have built a coarser-grained simulation than the previous lesson, will we still see Turing patterns?Visit tutorialReflection on the Gray-Scott modelIn contrast to the particle-based simulator introduced earlier, the Gray-Scott model produced an animation in under a minute on a laptop. We show the results of this model in the videos that follow. Throughout these animations, we use the parameters dA = 1.0, dB = 0.5, and r = 1, and we color each cell according to its value of [B]/([A]+[B]) using the “Spectral” color map.Our first video shows an animation of the Gray-Scott model using the parameters f = 0.034 and k = 0.095. We use a comparable initial configuration of the automaton as in the diffusion example, in which a cluster of B particles are found in a board full of A particles.If we expand the size of the simulation and add multiple clusters of B particles to the automaton, then the patterns become more complex as waves of B particles collide.If we keep the feed rate constant and increase the kill rate slightly to k = 0.097, then the patterns change significantly into spots.If we make the A particles a little happier as well, increasing  f to 0.038 and k to 0.099, then we have a different striped pattern.And if we increase f to 0.042 and k to 0.101, then again we see spots.The point is that very slight changes in our model’s parameters can produce drastically different results in terms of the patterns that we witness. In this prologue’s conclusion, we will connect this observation back to our original motivation of identifying the cause for animal skin patterns.Next lesson            P. Gray and S.K. Scott, Autocatalytic reactions in the isothermal, continuous stirred tank reactor: isolas and other forms of multistability, Chemical Engineering Science 38 (1983) 29-43. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Module 1: Finding Motifs in Transcription Factor Networks",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/home",
        "date"     : "",
        "content"  : "Introduction: Networks Rule (Biology)In the prologue, we worked with a particle-based model that simulated the interactions of skin cells to produce complex Turing patterns. In this module, we will zoom into the molecular level and model protein interactions. The scale of these interactions is tiny: a protein is typically on the order of about 10nm in diameter. (For comparison, the diameter of a single human hair is about 100,000 nm, and a light microscope’s highest resolution is about 2,000 nm.) We will see that the cell has evolved a form of molecular communication based on protein interactions that is rapid, robust, and elegant.To model protein interactions, we will use a  network, which is a collection of nodes along with edges that connect pairs of nodes. Whether we are studying the interactions of proteins, the complex chains of chemical reactions underlying cellular metabolism, the tangled webs of neurons in the human nervous system, or an evolutionary tree of life, networks are critical to our understanding of biological processes.Our interest lies in the frequently recurring structures hidden within biological networks called network motifs. Similarly to our work in the prologue, we will use modeling to answer of why these motifs have evolved to help the cell respond to its environment.We will soon define our specific network of study, but before we get ahead of ourselves, we will introduce some molecular biology fundamentals that we will need to complete our analysis. You may already know this biological background, in which case you should feel free to skim the next lesson.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Module 3: Analyzing the Coronavirus Spike Protein",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/home",
        "date"     : "",
        "content"  : "Introduction: A Tale of Two Doctors  One of the world’s most important warning systems for a deadly new outbreak is a doctor’s or nurse’s recognition that some new disease is emerging and then sounding the alarm. It takes intelligence and courage to step up and say something like that, even in the best of circumstances.  Tom Inglesby 1, Director of the Center for Health Security at Johns Hopkins Bloomberg School of Public HealthThe world’s fastest outbreakOn February 21, 2003, a Chinese doctor named Liu Jianlun flew to Hong Kong to attend a wedding and checked into Room 911 of the Metropole Hotel. The next day, he became too ill to attend the wedding and was admitted to a hospital. Two weeks later, he was dead.On his deathbed, Dr. Liu stated that he had recently treated sick patients in Guangdong Province, where a highly contagious respiratory illness had infected hundreds of people. The Chinese government had made brief mention of this incident to the World Health Organization (WHO) but had concluded that the likely culprit was a common bacterial infection. By the time anyone realized the severity of the disease, it was already too late to stop the outbreak.On February 23, a man who had stayed across the hall from Dr. Liu at the Metropole traveled to Hanoi and died after infecting 80 people. On February 26, a woman checked out of the Metropole, traveled back to Toronto, and died after initiating an outbreak there. On March 1, a third guest was admitted to a hospital in Singapore, where sixteen additional cases of the illness arose within two weeks.23.Consider that the Black Death, which killed over a third of all Europeans, took four years to travel from Constantinople to Kiev.  Or that HIV took two decades to circle the globe. In contrast, this mysterious new disease had crossed the Pacific Ocean within a week of entering Hong Kong.As health officials braced for the impact of the fastest traveling virus in human history, panic set in. Businesses were closed, sick passengers were removed from airplanes, and Chinese officials threatened to execute anyone deliberately spreading the disease. In the process, the mysterious new illness earned a name: severe acute respiratory syndrome, or SARS.Tracing the source of the outbreakSARS was deadly, killing close to 10% of those who became sick.4 But it also struggled to spread within the human population, and it was contained in July 2003 after accumulating fewer than 10,000 confirmed symptomatic cases worldwide.In 2017, researchers published the results of sampling horseshoe bats for five years in a cave in Yunnan province. They found that these bats harbored coronaviruses with remarkable genetic similarity to the virus causing SARS. Yet their work has become infamous because they identified additional coronaviruses in the bats that were capable of entering human cells. Their words are now chilling:5  We have also revealed that various [viruses]  are still circulating among bats in this region. Thus, the risk of spillover into people and emergence of a disease similar to SARS is possible. This is particularly important given that the nearest village to the bat cave we surveyed is only 1.1 km away, which indicates a potential risk of exposure to bats for the local residents. Thus, we propose that monitoring of SARSr-CoV evolution at this and other sites should continue, as well as examination of human behavioral risk for infection and serological surveys of people, to determine if spillover is already occurring at these sites and to design intervention strategies to avoid future disease emergence.A new threat emergesOn December 30, 2019, a Chinese ophthalmologist named Li Wenliang sent a WeChat message to fellow doctors at Wuhan Central Hospital, warning them that he had seen several patients with symptoms resembling SARS 1. He urged his colleagues to wear protective clothing and masks to shield them from this new threat.The next day, a screenshot of his post was leaked online, and local police summoned Dr. Li and forced him to sign a statement that he had “severely disturbed public order”. He then returned to work, treating patients in the same Wuhan hospital.Meanwhile, WHO received reports of multiple pneumonia cases from the Wuhan Municipal Health Commission and activated a support team to assess the new disease. WHO declared on January 14 that local authorities had seen “no clear evidence of human-to-human transmission of the novel coronavirus”. But once again, it was already too late.Throughout January, the virus silently raged through China as Lunar New Year celebrations took place within the country, and it spread to both South Korea and the United States. By the end of the month, the disease was in 19 countries, becoming a pandemic and earning a name in the process: coronavirus disease 2019 (COVID-19).As for Dr. Li? Despite warning against the risk of the new virus, he contracted COVID-19 from one of his patients on January 8. He continued working until he was forced to be admitted to the hospital on January 31. Within a week, he was dead, one of the first of millions of COVID-19 casualties.The sequence of the SARS-CoV-2 spike proteinThe viruses causing the two outbreaks, SARS coronavirus (SARS-CoV) and SARS coronavirus 2 (SARS-CoV-2) are both coronaviruses, which means that their outer membranes are covered in a layer of spike proteins that cause them to look like the sun’s corona during an eclipse (see figure below).Coronaviruses as seen under a microscope. The fuzzy blobs on the cell surface are spike proteins, which the virus uses to gain entry to host cells. Figure courtesy F. Murphy and S. Whitfield, CDC6.When viewed under a microscope, the two viruses look identical, and they use the same mechanism to infect human cells, when the spike protein on the virus surface bonds to the ACE2 enzyme on a human cell’s membrane.78 So why did SARS fizzle, but SARS-CoV-2, a disease that is on average less harmful910 and less deadly to individuals who contract it, transform into a pandemic? The most likely explanation for the ability of SARS-CoV-2 to spread across far more countries and remain a public health threat even in the face of lockdowns is that it spreads more easily; that is, it is more infectious. Is there a molecular basis of this increased infectiousness?In this module, we will place ourselves in the shoes of early SARS-CoV-2 researchers studying the new virus in early 2020. The virus’s genome, consisting of nearly 30,000 nucleotides, was published on January 101112, and an annotation of this genome identifying the location of the virus’s genes is shown in the figure below. Upon sequence comparison, SARS-CoV-2 was found to be related to several coronaviruses isolated from bats and distantly related to SARS-CoV.An annotated genome of SARS-CoV-2, with rectangles showing the location of areas encoding RNA or protein. The spike protein, found at the bottom of this image, is labeled “S” and begins at nucleotide position 21,563. Accessed from GenBank: https://go.usa.gov/xfzMM.Recall from our discussion of transcription factors that by the central dogma of molecular biology, DNA is transcribed into RNA, which is then translated into protein. According to the genetic code, triplets of RNA nucleotides called codons are converted into single amino acids. The resulting chain of amino acids is called a polypeptide.Note: Coronaviruses are RNA viruses, which means that they do not have DNA and their genome is encoded as a single strand of RNA. As a result, they bypass the DNA to RNA transcription process.The gene encoding the spike protein starts at nucleotide position 21,563 of the SARS-CoV-2 genome, and the corresponding translated polypeptide chain is shown below. Each of the 20 standard amino acids is represented by a letter taken from the Latin alphabet (all letters except for “B”, “J”, “O”, “U”, “X”, and “Z” are used). As you examine the string of letters in this figure, consider how global mayhem can ultimately be caused by something so tiny.&gt;YP_009724390.1 S [organism=Severe acute respiratory syndrome coronavirus 2] [GeneID=43740568]MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYTNature’s magic protein folding algorithmAfter the SARS-CoV-2 spike protein polypeptide chain is formed, it will “fold” into a three-dimensional shape. This folding process occurs spontaneously for all proteins and without any outside influence, and the same polypeptide chain will almost always fold into the same 3-D structure in a manner of microseconds. Nature must be applying some “magic algorithm” to quickly produce the folded structure of a protein from its sequence of amino acids.Predicting the folded structure of a polypeptide is called the protein structure prediction, a problem that is simple to state but difficult to solve. In fact, protein structure prediction has been an active area of biological research for several decades.Why do we care about protein structure? Knowing a protein’s structure is essential to determining its function and how it interacts with other proteins or molecules in its environment. After all, we still do not know the function of a few thousand human genes, and Figure 1.4 shows the huge variety of protein shapes in the 2020 “proteins of the month” named by the Protein Data Bank (PDB). (Note that the June 2020 winner is the SARS-CoV-2 spike protein.)Each “molecule of the month” in 2020 named by the PDB. These proteins have widely varying shapes and accomplish a wide variety of cellular tasks. The SARS-CoV-2 spike protein was the molecule of the month in June. Image courtesy: David Goodsell.For a more visual example of how protein structure affects protein function, consider the following video of a ribosome (which is a complex of RNA and proteins) translating a messenger RNA into protein. For translation to succeed, the ribosome needs to have a very precise shape, including a “slot” into which the messenger RNA strand can fit.      Another central problem in protein structural research is devoted to understanding protein interactions. For example, a disease may be caused by a faulty protein, in which case researchers want to find a drug that binds to the protein and causes some change of interest in that protein, such as inhibiting its behavior.In this module, we will consider two questions. First, can we reverse engineer nature’s magic algorithm and determine the spike protein’s shape from its sequence of amino acids in the figure above? Second, once we know this molecular structure of the SARS-CoV-2 spike protein, how does its structure and function differ from the same protein in SARS-CoV?These two questions are both significant, and so we will split our work on them over two parts. If you are already familiar with protein structure prediction, then you may want to skip ahead to the second part of the module, in which we discuss differences between the spike proteins of the two viruses.Continue to part 1: structure predictionJump to part 2: spike protein comparisons            Green, A. (2020, February 18). Li Wenliang. The Lancet, 395(10225), P682. https://doi.org/10.1016/S0140-6736(20)30382-2 &#8617; &#8617;2              Hung L. S. 2003. The SARS epidemic in Hong Kong: what lessons have we learned?. Journal of the Royal Society of Medicine, 96(8), 374–378. https://doi.org/10.1258/jrsm.96.8.374 &#8617;              Update 95 - SARS: Chronology of a serial killer. (2015, July 24). Retrieved August 17, 2020, from https://www.who.int/csr/don/2003_07_04/en/ &#8617;              CDC SARS Response Timeline. 2013, April 26. Retrieved August 17, 2020, from https://www.cdc.gov/about/history/sars/timeline.htm &#8617;              Hu, B., Zeng, L., Yang, X., Ge, X., Zhang, W., Li, B., Xie, J., Shen, X., Zhang, Y., Wang, N., Luo, D., Zheng, X., Wang, M., Daszak, P., Wang, L., Cui, J., Shi, Z. 2017. Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus. PLOS Pathogens, 13(11). doi:10.1371/journal.ppat.1006698 &#8617;              Murphy, F., Whitfield, S. 1975. ID#: 10270. Public Health Image Library, CDC. https://phil.cdc.gov/Details.aspx?pid=10270 &#8617;              Shang, J., Ye G., Shi, K., Wan, Y., Luo, C., Aihara, H., Geng, Q., Auerbach, A., Li, F. 2020. Structural basis of receptor recognition by SARS-CoV-2. Nature 581, 221-224. &#8617;              Li, F., Li, W., Farzan, M., Harrison, S. C. 2005. Structure of SARS Coronavirus Spike Receptor-Binding Domain Complexed with Receptor. Science 309, 1864-1868. &#8617;              Q&amp;A on coronaviruses (COVID-19). (2020, April 17). https://www.who.int/emergencies/diseases/novel-coronavirus-2019/question-and-answers-hub/q-a-detail/q-a-coronaviruses &#8617;              Paules C.I., Marston H.D., Fauci A.S. 2020. Coronavirus Infections—More Than Just the Common Cold. JAMA. 323(8):707–708. doi:10.1001/jama.2020.0757 &#8617;              Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. https://www.ncbi.nlm.nih.gov/nuccore/MN908947 &#8617;              Annotated Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. https://go.usa.gov/xfzMM &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Module 4: Training a Computer to Classify White Blood Cells",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/home",
        "date"     : "",
        "content"  : "Introduction: How Are Blood Cells Counted?Your doctor sometimes counts your blood cells to ensure that they are within healthy ranges as part of a complete blood count. These cells comprise red blood cells (RBCs), which transport oxygen via the hemoglobin protein, and white blood cells (WBCs), which help identify and attack foreign cells as part of your immune system.The classic device used for counting blood cells is the hemocytometer. As illustrated in the video below, a technician filters a small amount of blood onto a gridded slide and then counts the number of cells of each type in the squares on the grid. As a result, the technician can estimate the number of each type of cell per volume of blood.      STOP: How could the volume of a blood sample influence the estimate of blood cell count?You would not be wrong to think that the hemocytometer seems old-fashioned, as it was invented by Louis-Charles Malassez 150 years ago. To reduce the human error inherent in using this device, can we train a computer to count blood cells for us?In this module, we will focus our work on WBCs, which divide into families based on their structure and function, with some diseases causing an abnormally low or high count of cells in a specific family.  We therefore have two aims. First, can we excise, or segment, WBCs from cellular images? Second, can we devise an algorithm to classify WBCs by family?We will work with a publicly available dataset containing blood cell images that depict both RBCs and WBCs, as shown in the figure below. The cells have been stained with a red-orange dye that bonds to hemoglobin and a blue dye that bonds to DNA and RNA. RBCs lack a nucleus and will only absorb the red-orange dye, whereas WBCs have a nucleus but lack hemoglobin and will only absorb the blue dye.The figure below also illustrates the three main families of WBCs: granulocytes, monocytes, and lymphocytes.  Granulocytes have a multilobular nucleus, which consists of several “lobes” that are linked by thin strands of nuclear material. Monocyte and lymphocyte nuclei only have a single lobe, but monocyte nuclei have a more irregular shape, whereas lymphocyte nuclei are more rounded and take up a greater fraction of the cell’s volume.                                                                                                        Three images from the blood cell image dataset showing three types of WBCs. In our dataset, these cells correspond to image IDs 3, 15, and 20. (Left) A specific subtype of granulocyte called a neutrophil, illustrating the multilobular structure of this WBC family. (Center) A monocyte with a single, irregularly-shaped nucleus. (Right) A lymphocyte with a small, round nucleus.  When you look at the cells in the figure above, you may think that our work will be easy. Segmenting WBC nuclei only requires identifying the large purplish regions, and classifying them is just a matter of categorizing them according to the differences in nuclear shape that we described above. Yet even after decades of research into computer vision, researchers struggle to attain the precision of the human eye, a biological apparatus that is the result of billions of years of evolution.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Prologue: Random Walks and Turing Patterns",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/",
        "date"     : "",
        "content"  : "Introduction: Alan Turing and the Zebra’s StripesIf you are familiar with Alan Turing, then you might be surprised that a famous computer scientist would appear in the first sentence of a book on biological modeling. After all, he is best known for two achievements that have nothing to do with biology. In 1936, Turing theorized a primitive computer consisting of a tape of cells along with a machine that writes symbols on the tape according to a set of predetermined set of rules; this “Turing machine” 1 is simple but nevertheless has been proven capable of solving any problem that a modern computer can solve. Then, during World War II, Turing worked with Allied cryptographers at Bletchley Park to devise machines that broke several German ciphers.Alan Turing in 1951. © National Portrait Gallery, London.Yet in 1952, two years before his untimely demise, Turing published his only paper on biochemistry, which addressed the question: “Why do zebras have stripes?”2 He was not asking why zebras have evolved to have stripes  this question was unsolved in Turing’s time, and recent research has indicated that the stripes may be helpful in warding off flies. Rather, Turing was interested in what biochemical mechanism could produce the stripes that we see on a zebra’s coat. And he reasoned that just as a simple machine can emulate a computer, some limited set of molecular “rules” could cause stripes to appear on a zebra’s coat.In this prologue, we will introduce a particle simulation model based on Turing’s ideas. We will be see that a system built on very simple rules and even randomness can nevertheless produce emergent behavior that is complex and elegant. And we will explore how this model can be tweaked to provide a hypothesis for the origin of not just the zebra’s stripes but also the leopard’s spots.Next lesson            Turing, Alan M. (1936), “On Computable Numbers, with an Application to the Entscheidungsproblem”, Proceedings of the London Mathematical Society, Ser. 2, Vol. 42: 230-265. &#8617;              Turing, Alan (1952). “The Chemical Basis of Morphogenesis” (PDF). Philosophical Transactions of the Royal Society of London B. 237 (641): 37–72. Bibcode:1952RSPTB.237…37T. doi:10.1098/rstb.1952.0012. JSTOR 92463. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Module 2: Unpacking E. coli’s Genius Exploration Algorithm",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/home",
        "date"     : "",
        "content"  : "The lost immortalsThe book What If?1, by Randall Munroe, compiles a collection of crazy scientific hypotheticals, paired with thorough discussions of what might happen if these situations occurred. One such hypothetical, called “Lost Immortals”, ponders how two immortal humans might find each other if they were stranded in different locations of an uninhabited planet.We could imagine many ideas for how the immortals could reunite. For example, they could avoid the interiors of continents by moving to the coastlines. If they are allowed to discuss how to find each other in advance, then they could agree to meet at the planet’s North Pole  assuming that the planet lacks polar bears.But Munroe provides a solution that is both sophisticated and elegant. He argues that without additional information, the immortals should walk randomly, leaving markers in their wake pointing in the direction that they travel, and resting frequently. If one immortal finds the other’s trail, then they should follow it, resting less and traveling faster, until some time has expired or they lose the trail.In the previous two modules, we have harnessed the power of randomness to answer to practical questions. Munroe’s approach exemplifies a randomized algorithm, or a method that uses randomness to solve a problem.In fact, Munroe’s randomized algorithm is inspired by nature; he calls his approach “be an ant” because it mimics how ants explore their environment for resources.  However, in this module, we will see that the lost immortals’ algorithm is also similar to the method of exploration taken by a much smaller organism: our old friend E. coli.Like other prokaryotes, E. coli is tiny, with a rod-shaped body that is 2µm long and 0.25 to 1µm wide.2 In exploring a vast world with sparse resources, E. coli finds itself in a situation comparable to that of the lost immortals.The video below shows a collection of E. coli surrounding a sugar crystal. Think of this video the next time you leave a slice of cake out overnight on the kitchen counter!      The movement of organisms like the bacteria in the above video in response to a chemical stimulus is called chemotaxis. E. coli and other bacteria have evolved to move toward attractants like glucose and electron acceptors and move away from repellents like Ni2+ and Co2+.In this module, we will delve into chemotaxis and ask a number of questions. How does a simple organism like E. coli sense an attractant or repellent in its environment? How does the bacterium change its internal state accordingly? How can we model the bacterium’s response? And how does the bacterium’s behavior translate into an “algorithm” that it uses to explore its environment?Next lesson            Randall Munroe. What If? Available online &#8617;              Pierucci O. 1978. Dimensions of Escherichia coli at various growth rates: Model of envelope growth. Journal of Bacteriology 135(2):559-574. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Homology Modeling",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/homology",
        "date"     : "",
        "content"  : "Homology modeling finds a similar protein structureIn the previous lesson, we saw that ab initio structure prediction of a long protein like the SARS-CoV-2 spike protein can be time consuming and error prone. As we mentioned in the introduction to structure prediction, however, researchers have entered over 160,000 experimentally verified structure entries into the PDB. With every new structure that we identify, we gain a little more insight into nature’s magic protein folding algorithm. In homology modeling (also called comparative modeling), we use the information contained in known structures to help us predict the structure of a protein with unknown structure.The structure of the SARS-CoV spike protein was determined in 2003. Assuming that the two proteins have similar structure, we will use SARS-CoV spike protein’s known structure as a guide to help us predict the structure of the SARS-CoV-2 spike protein. In other words, we restrict the search space to those structures that are similar to the SARS-CoV spike protein structure.STOP: In the case of the SARS-CoV-2 spike protein, we already know that we want to use the SARS-CoV spike protein as a template. However, if we do not know which template to use before we begin, how could we find a candidate protein template?A similar structure reduces the size of the search spaceOnce we have found a protein with potentially similar structure to our protein of interest, we need to use this known structure to predict the structure of our protein. One way of doing so is to include a “similarity term” in the energy function that subtracts a structure’s similarity to the template structure from the structure’s total energy. That is, the more similar that a candidate structure is to the template, the more negative the contribution of this similarity term. To continue our search space analogy, the template protein “pulls down” the energy values of nearby structures like a gravity well.Another way to perform homology modeling is to account for variance in similarity across different regions of the two proteins. The SARS-CoV and SARS-CoV-2 genomes are 96% similar, but their are only 76% similar. In general, when we examine genomes from related species, we see conserved regions where the species are very similar and variable regions where the species are more different than the average.The phenomenon of conserved and variable regions also occurs within individual genes. The following figure shows that within a spike protein subunit, the S2 domain is 90% similar between the two viruses, whereas the S1 domain is only 64% similar. Furthermore, the S1 domain divides into two subunits of differing similarity.Variable and conserved regions in the SARS-CoV and SARS-CoV-2 spike proteins. The S1 domain tends to be more variable, whereas the S2 domain is more conserved. In this figure, “NTD” stands for “N-terminal domain” and “RBD” stands for “receptor binding domain”, two subunits of the S1 domain.Some homology modeling algorithms account for variable and conserved regions by assuming that very conserved regions in the two genes correspond to essentially identical structures in the proteins. That is, the structure of our protein of interest in these regions will be the same as those of the template protein. We can then use a fragment library, a catalog of known substructures from many proteins, to fill in the structure of non-conserved regions based on structures of fragments whose sequence is similar to these regions. This approach is called fragment assembly.We will model the SARS-CoV-2 spike protein using homology modeling software from three publicly available fragment assembly servers (SWISS-MODEL, Robetta, and GalaxyWEB). If the results are similar, then we have faith in the robustness of our predictions when using different approaches. Furthermore, comparing the results of multiple different approaches may give us more insights into structure prediction. If you are not interested in following this tutorial, links to the results can be found in the table below.Visit tutorial            Structure Prediction Server      Results                  SWISS-MODEL (S protein)      SWISS-MODEL Results              Robetta (Single-Chain S protein)      Robetta Results              GalaxyWEB      GalaxyWEB Results      Experiments determine the structure of the SARS-CoV-2 spike proteinOn February 25, 2020, two months after the publication of the SARS-CoV-2 genome, researchers from the Seattle Structural Genomics Center for Infectious Disease published the result of a cryo-EM experiment for the SARS-CoV-2 spike protein, which became PDB entry 6vxx.Note: If you would like to explore the structure of the SARS-CoV-2 spike protein, check out the 3-D protein viewer at https://www.rcsb.org/3d-view/6vxx.We now turn to the problem of comparing our homology modeling results of the SARS-CoV-2 spike protein against the experimentally verified structure of SARS-CoV-2. How to compare two protein structures is a simple question, but it has a complicated answer, to which we will devote an entire section.Next lesson"
     
   } ,
  
   {
     
        "title"    : "&lt;em&gt;Biological Modeling&lt;/em&gt;",
        "category" : "",
        "tags"     : "",
        "url"      : "/",
        "date"     : "",
        "content"  : "Welcome to Biological Modeling!Have you ever wondered why zebras have stripes? Have you ever wondered how your cells can quickly react to their environment and perform complex tasks without intelligence? Have you ever wondered why the original SARS coronavirus fizzled out but SARS-CoV-2 has spread like wildfire around the planet? Have you ever wondered how  algorithms can be trained to “see” cells as well as a human?What these questions share is that they can start to be answered by modeling biological systems at multiple “scales” of resolution, from the microscopic to the molecular.In this free course, we will build models of biological systems that are relatively simple but nevertheless provide us with deep, fascinating insights into how those systems operate.Course structure and contentsThis online course is divided into a prologue and four main modules. Each of the five parts of the course covers a single biological topic (e.g., “Analyzing the structure of the SARS-CoV-2 spike protein”) and introduces all of the modeling topics needed to address that topic from first principles. The modules build on each other, so we suggest covering them in order, although it is possible to complete them out of order.Each module has a main narrative that can be explored by anyone, including beginner modelers; this main narrative will form our upcoming book. When we need to build a model along the way, we pass our modeling work to “software tutorials” that show how to use high-powered modeling software produced by MMBioS researchers to build biological models. The software tutorials allow users wishing to get their hands dirty with modeling software to build all of the models that we need in this course. This allows for a course that can be explored by both casual and serious biological modeling learners alike.After building a model in a software tutorial, we will return to the main text and analyze the results of this model. In this way, the text forms a constant interplay between establishing a biological problem, describing how a model work, implementing that model in a software tutorial, and returning to the text to analyze the model and ask our next question, beginning the cycle anew.Our course contents are found below.      Prologue: Random walks and Turing patterns (with software tutorials featuring MCell and CellBlender)        Module 1: Finding motifs in transcription factor networks (with software tutorials featuring MCell and CellBlender)        Module 2: Unpacking E. coli’s genius exploration algorithm (with software tutorials featuring BioNetGen)        Module 3: Analyzing the coronavirus spike protein (with software tutorials featuring ProDy and affiliated tools)        Module 4: Training a computer to classify white blood cells (with software tutorials featuring CellOrganizer)  Meet the teamThis course was lovingly put together by a professor and a team of wonderful students in Carnegie Mellon University’s Computational Biology Department. You can meet us on our Meet the Team page.Get the bookBiological Modeling has an official textbook companion, called Biological Modeling: A Short Tour.  Print book: Get it from Amazon!  E-book: Get it from Leanpub!The publication of Biological Modeling: A Short Tour was graciously funded by a community of backers at Kickstarter and Indiegogo.Course survey and contact formPlease use our anonymous survey so that we can track information about the demographics of our learners.Whether you loved our course and would like to provide a testimonial, or you’re an instructor interested in adopting this material in your class, or you just want to say hi, then we would love to hear from you. Please use our contact page to get in touch!AcknowledgementsThis online course is a training and dissemination effort for the National Center for Multiscale Modeling of Biological Systems (MMBioS). It is supported by the National Institutes of Health (grant ID: P41 GM103712).We would first and foremost like to thank everyone working on MMBioS software; their work allowed this project to come about. Chiefly, thank you to the other members of our training and dissemination team (Alex Ropelewski, Joe Ayoob, and Rozita Laghaei) as well as the head of the MMBioS consortium, Jim Faeder.We are also very grateful to Wendy Velasquez Ebanks, Julien Gomez, Yanjing Li, Ulani Qi, Aditya Parekh, and Shalini Panthangi, who provided additional work on the course during its conception.Module 1 was in part inspired by Uri Alon’s research and superlative book An Introduction to Systems Biology, a landmark biological textbook that we strongly recommend if you are interested in a greater discussion of biological network motifs.Special thanks to Jiayi Shou for the analogy in Module 3 of new protein companies rising like “bamboo shoots after the rain”.The cover image on Module 4 was created by Keith Chambers.Finally, the website design was built using Michael Rose’s excellent Minimal Mistakes theme.You might also enjoy…If you like this course, then we would suggest some additional resources below.Additional open educational materials in computational biology and programmingWe think you would love some of the other free education projects developed by the project founder.  We list these resources below.      Bioinformatics Algorithms: An Active Learning Approach: A best-seller in its field, this textbook has been adopted by over 170 instructors in 40 countries around the world. It has also been used as the basis of the Bioinformatics Specialization on Coursera, which has reached hundreds of thousands of online learners. The first several chapters of the book is available for free on the textbook website.        Rosalind: An open platform for learning bioinformatics independently through problem solving.        Programming for Lovers: An open course in introductory programming aimed at science students. The course is still in development.  MMBioS Training WorkshopsIf you are a biological modeling researcher and want to learn more about how the software resources presented here can be applied to your work, please check out the workshops organized as part of the MMBioS project to which this course belongs.Another textbook on biological modelingA colleague at Carnegie Mellon University, Russell Schwartz, authored a textbook called Biological Modeling and Simulation that you may enjoy. Dr. Schwartz’s book focuses on a different collection of modeling topics than this work."
     
   } ,
  
   {
     
        "title"    : "Meet the Team",
        "category" : "",
        "tags"     : "",
        "url"      : "/meet-the-team/",
        "date"     : "",
        "content"  : "Meet the Team                          Phillip Compeau        Founder and Director         Phillip Compeau is an Associate Teaching Professor and the Assistant Department Head in the Computational Biology Department at Carnegie Mellon University. He directs the undergraduate program in computational biology and co-directs the Precollege Program in Computational Biology, both of which he co-founded.Phillip is passionate about open online education, and his education projects have reached hundreds of thousands of learners around the world. He is the co-author of Bioinformatics Algorithms: An Active Learning Approach, which has been adopted by 200 instructors in 40 countries, and which powers the popular Bioinformatics Specialization on Coursera. He co-founded the learning platform Rosalind for learning programming, bioinformatics, and algorithms through independent problem solving. He founded Programming for Lovers, an online course in introductory programming motivated by fun scientific applications.        Home Page                                      Noah Yann Lee        Web Designer &amp; Content Developer        Noah Yann Lee is a PhD student at Yale University under the Computational Biology and Bioinfornatics program. Noah completed his undergraduate at Carnegie Mellon University, graduating in 2020 with a B.S. in Computational Biology with a minor in Design for Learning. From running early-childhood educational tests with the Children’s School at Carnegie Mellon for the Global Learning XPRIZE, to cultivating and sequencing phage genomes with the PhageHunters program, Noah has an appreciation for science from the micro to the macro, physical to the digital. Noah is always interested to connect with projects and organizations working with STEM, education, and science outreach.                                      Chris Lee        Content Developer        Chris Lee is a current graduate student at Carnegie Mellon University and is in the M.S. in Computational Biology Program. Previously, he was an undergraduate student at Rutgers University and worked as an undergraduate researcher studying hydrothermal vent bacteria. In 2019, Chris graduated magna cum laude with a B.A. in Molecular Biology &amp; Biochemistry and double minor in Chemistry and Computer Science. He is currently interested in the fields of bioinformatics and genomics.                                      Shuanger Li        Content Developer        Shuanger is an MSc student studying Computational Biology at CMU. She is interested in theories of evolution and ecology, and is currently working with Dr. Oana Carja on heritable phenotypic variability. She enjoys modeling and simulation as powerful and fun ways to understand biological systems. She double majored in Environmental Sciences and Microbial Biology at UC Berkeley, where she studied Hawaiian arthropod assemblages, spider behaviors, and remediation bioreactors.                                      Mert Inan        Content Developer         Mert is currently a computer science Ph.D. student at the University of Pittsburgh. Mert is an alum of the M.S. in computational biology program at Carnegie Mellon University. He loves interdisciplinary fields and has been working at the intersection of computation, biology, neuroscience, and machine intelligence. Unlocking the secrets of biology is a pleasure that Mert truly enjoys, even under quarantine conditions.                                      Nicole Matamala        Content Developer        Nicole Matamala is an alum of the B.S. in computational biology program at Carnegie Mellon University.            "
     
   } ,
  
   {
     
        "title"    : "Finding Local Differences in Protein Structures with Qres",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/multiseq",
        "date"     : "",
        "content"  : "In part 1 of this module, we used a variety of existing software resources to predict the structure of the SARS-CoV-2 spike protein from its amino acid sequence. We then discussed how to compare our predicted structures against the experimentally confirmed structure of the protein.Now begins part 2, in which we examine how the structure of the SARS-CoV-2 spike protein compares against the SARS-CoV spike protein. In keeping with the biological maxim that the structure of a protein informs the function of that protein, can we find any clues lurking in the spike proteins’ structure that would indicate why the two viruses had different fates?Focusing on a variable region of interest in the spike proteinIn part 1, we saw that the spike protein is much more variable than other regions of the coronavirus genome. We even see variable and conserved regions within the spike protein, as the following figure (reproduced from the section on homology modeling) indicates.Variable and conserved regions in the SARS-CoV and SARS-CoV-2 spike proteins. The S1 domain tends to be more variable, whereas the S2 domain is more conserved.One variable spike protein region is the receptor binding motif (RBM), part of the receptor binding domain (RBD), whose structure we predicted using GalaxyWEB in the homology modeling tutorial. The RBM mediates contact with ACE2, as the following simplified animation of the process illustrates.      The fact that the region binding to the target human enzyme has mutated so much makes it a fascinating region of study. Do the mutations that SARS-CoV-2 has accumulated somehow make it easier for the virus to infect human cells?The figure below shows an alignment of the 70 amino acid long RBM region from SARS-CoV and SARS-CoV-2.An alignment of the RBM of the human SARS-CoV virus (first row) and the SARS-CoV-2 virus (second row). Amino acids that are highlighted in green represent matches between the two RBM sequences. Beneath each column, a bar illlustrates conservation between the two sequences, where full conservation indicates a match and partial conservation indicates a mismatch.We already know from our previous work in this module that just because the sequence of a protein has been greatly mutated does not mean that the structure of that protein has changed much. Therefore, in this lesson, we will start a structural comparison of the SARS-CoV and SARS-CoV-2 spike proteins. All of this analysis will be performed using the software resources ProDy and VMD.In addition to verifying the structure of the spike protein in both SARS-CoV and SARS-CoV-2, researchers also determined the structure of the RBD complexed with ACE2 in both SARS-CoV (PDB entry: 2ajf) and SARS-CoV-2 (PDB entry: 6vw1).Because we know the structures of the bound complexes, we can produce 3-D visualizations of the two different complexes and look for structural differences involving the RBM. We will use VMD to produce this visualization, rotating the structures around to examine potential differences. However, we should be wary of only trusting our eyes to guide us; how can a quantitative approach tell us where to look for structural differences between the two RBMs?Note: The experimentally verified SARS-CoV-2 structure is a chimeric protein formed of the SARS-CoV RBD in which the RBM has the sequence from SARS-CoV-2 1. A chimeric RBD was used for complex technical reasons to ensure that the crystallization process during X-ray crystallography could be borrowed from that used for SARS-CoV.Contact maps visualize global structural differencesRecall from part 1 the definition of RMSD for two protein structures s and t, in which each structure is represented by the positions of its n alpha carbons (s1, …, sn) and (t1, …, tn).\[\text{RMSD}(s, t) = \sqrt{\dfrac{1}{n} \cdot (d(s_1, t_1)^2 + d(s_2, t_2)^2 + \cdots + d(s_n, t_n)^2)}\]If two similar protein structures differ in a few locations, then the corresponding alpha carbon distances d(si, ti) will likely be higher at these locations. However, one of the weaknesses of RMSD that we pointed out in part 1 of this module is that a change to a single bond angle at the i-th position may cause d(sj, tj) to be nonzero when j &gt; i, even though the structure of the protein downstream of this bond angle has not changed. For example, when we discussed the Kabsch algorithm, we showed the figure below of two protein structures that are identical except for a single bond angle. All of the alpha carbon distances d(si, ti) for i at least equal to 4 will be thrown off by this changed angle.Two toy protein structures in which the bond angle between the third and fourth alpha carbon has been changed. This change does not affect the distance between the i-th and j-th alpha carbons when i and j are both at least equal to 4.However, note that when i and j are both at least equal to 4, the distance d(si, sj) between the i-th and j-th alpha carbons in S will still be similar to the distance d(ti, tj) between the same alpha carbons in T. This observation leads us to a more rigorous approach for measuring differences in two protein structures, which compares all pairwise intraprotein distances d(si, sj) in one protein structure against the corresponding distances d(ti, tj) in the other structure.To help us visualize all these pairwise distances, we will introduce the contact map of a protein structure s, which is a binary matrix indicating whether two alpha carbons are near each other in s. After setting a threshold distance, we set M(i, j) = 1 if the distance d(si, sj) is less than the threshold, and we set M(i, j) = 0 if d(si, sj) is greater than or equal to the threshold.The figure below illustrates the contact maps for both full proteins and single chains of the SARS-CoV-2 and SARS-CoV spike proteins, using a threshold distance of twenty angstroms. We color each contact map cell black if it is equal to 1 (corresponding to close amino acid pairs) and white if it is equal to 0 (corresponding to distant amino acid pairs).Note: Interested in learning how to make contact maps? We will use ProDy to do so in a later section.We observe two facts about these contact maps. First, many black values cluster around the main diagonal of the matrix, since amino acids that are nearby in a protein’s sequence will remain near each other in the protein’s structure. Second, the contact maps for the two proteins are very similar, reinforcing that the two proteins have similar structures. Contact map regions that differ provide regions for further investigation when comparing two proteins structurally.The contact maps of the SARS-CoV-2 spike protein (top left), SARS-CoV spike protein (top right), single chain of the SARS-CoV-2 spike protein (bottom left), and single chain of the SARS-CoV spike protein (bottom right). If the distance between the i-th and j-th amino acids in a protein structure is 20.0 angstroms or less, then the (i, j)-th cell of the figure is colored black. The SARS-CoV-2 and SARS spike proteins have very similar contact maps, indicating that they have similar structures.STOP: How do you think a contact map will change as we increase or decrease the threshold distance used to produce that map?Qres measures local structural differencesWe obtain some insight into how two proteins differ structurally at the i-th amino acid if we examine the values in the i-th row of the proteins’ contact maps; that is, if we compare all of the d(si, sj) values to all of the d(ti, tj) values. In practice, researchers combine all of this information into a single metric called Q per residue (Qres) measuring the similarity of two structures at the i-th amino acid.  The formal definition of Qres for two structures s and t having N amino acids is2\[Q_{res}^{(i)} = \dfrac{1}{N-k} \sum^{N}_{j\neq i-1,i,i+1} \textrm{exp}[-\dfrac{[d(s_i,s_j)-d(t_i,t_j)]^2}{2\sigma^2_{i,j}}]\, .\]In this equation, exp(x) denotes ex. This equation also includes the following parameters.  k is equal to 2 at either the start or the end of the protein (i.e., i is equal to 1 or N), and k is equal to 3 otherwise.  The variance term \(\sigma_{ij}^2\) is equal to \(\left\lvert{i-j}\right\rvert ^{0.15}\), which corresponds to the sequence separation between the i-th and j-th alpha carbons.Note: The above definition assumes that the two proteins have the same length or have been pre-processed by removing amino acids that only occur in one protein. Generalizations of Qres for proteins of non-equal length first align the sequences of two proteins and retain only those amino acids for structural comparison that are shared by the two proteins.If two proteins are very similar at the i-th alpha carbon, then for every j, the difference d(si, sj) - d(ti, tj) will be close to zero, meaning that each term inside the summation in the Qres equation will be close to 1. The sum will be equal to approximately N - k, and so Qres will be close to 1. As two proteins become more different at the i-th alpha carbon, then the term inside the summation will head toward zero, and so will the value of Qres.Qres is therefore a metric of similarity ranging between 0 and 1. Low Qres scores indicate that two proteins differ structurally at the i-th position, and high scores indicate that the two proteins are similar structurally at this position.We now will compute Qres for the SARS-CoV and SARS-CoV-2 spike proteins using the VMD plugin Multiseq, a bioinformatics analysis environment. After determining Qres, we will visualize the individual locations where the two RBD regions differ.Visit tutorialLocal comparison of spike proteins leads us to a region of interestBy computing Qres at every position of the two coronavirus RBD regions, we can form a structural alignment of the two regions, as shown in the figure below. Blue columns correspond to amino acids with high Qres (meaning high structural similarity), and red columns correspond to amino acids with low Qres (meaning low structural similarity). If we zoom in on the region around position 150 of the alignment, we find a 13-column region of the alignment within the RBD region for which Qres values are significantly lower than they are elsewhere. This region corresponds to positions 476 to 485 in the SARS-CoV-2 spike protein, which is part of the RBM.A snapshot of the sequence alignment between the SARS-CoV RBD (first row) and the SARS-CoV-2 chimeric RBD1 (second row). Columns are colored along a spectrum from blue (high Qres) to red (low Qres), with positions that correspond to an inserted or deleted amino acid colored red. The region with low Qres corresponds to amino acids at positions 476 to 485 in the SARS-CoV-2 spike protein.The figure below shows a 3-D visualization of the ACE2 enzyme (green) bound with the superimposed structures of both the SARS-CoV and SARS-CoV-2 RBD. The same color-coding of columns of the multiple alignment in the figure above is used to color positions in the superimposed RBDs. The low-Qres region of the RBM alignment that we highlighted in the above figure is outlined in the figure below.A visualization showing the superimposed structures of the SARS-CoV-2 chimeric RBD1  and SARS-CoV RBD, with individual amino acids colored blue or red depending on whether Qres is high or low, respectively.  The ACE2 enzyme is shown in green. The boxed region corresponds to the part of the RBM having a potential structural difference. Because this region is adjacent to ACE2, the structural difference will likely affect ACE2 interactions.Note: Although the rest of the proteins are similar, there are other parts of the RBD at the top of the protein that show dissimilarities in the two proteins, which may be attributable to an experimental artifact.Finding a region in the RBM where the structures of the SARS-CoV and SARS-CoV-2 spike proteins differ presents an exciting development.  We will next explore this region of the protein structure to determine whether the mutations acquired by SARS-CoV-2 may have influenced the binding affinity of the spike protein with the human ACE2 enzyme.Next lesson            Shang, J., Ye, G., Shi, K., Wan, Y., Luo, C., Aijara, H., Geng, Q., Auerbach, A., Li, F. 2020. Structural basis of receptor recognition by SARS-CoV-2. Nature 581, 221–224. https://doi.org/10.1038/s41586-020-2179-y &#8617; &#8617;2 &#8617;3              Li, L., Sethi, A., Luthey-Schulten, Z. Evolution of Translation Class I Aminoacyl-tRNA Synthetase:tRNA complexes. University of Illinois at Urbana-Champaign, Luthey-Schulten Group, NIH Resource for Macromolecular Modeling and Bioinformatics, Computational Biophysics Workshop. https://www.ks.uiuc.edu/Training/Tutorials/TCBG-copy/evolution/evolution_tutorial.pdf &#8617;      "
     
   } ,
  
   {
     
        "title"    : "The Negative Autoregulation Motif",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/nar",
        "date"     : "",
        "content"  : "Simulating transcriptional regulation with a reaction-diffusion modelTheodosius Dobzhansky famously wrote that “nothing in biology makes sense except in the light of evolution.”1 In the spirit of this quotation, some evolutionary reason must explain the presence of the large number of negatively autoregulating E. coli transcription factors that we identified in the previous lesson. Our goal is to use biological modeling to establish this justification.We will simulate a “race” to the steady state, or equilibrium, concentration of a transcription factor Y in two cells, shown in the figure below. In the first cell, a transcription factor X activates Y; in the second cell, Y also negatively autoregulates. Our premise is that the cell that reaches the steady state faster can respond more quickly to its environment and is therefore more fit for survival.The two cells that we wish to simulate. In the first cell (left), X only activates Y; in the second cell (right), Y also negatively autoregulates.We will simulate these two cells using a reaction-diffusion model analogous to the one introduced in the prologue. For our new model, the “particles” represent the two transcription factors X and Y.We begin with the first cell. To simulate X activating Y, we use the reaction X  X + Y. In a given interval of time, there is a constant underlying probability related to the reaction rate that any given X particle will spontaneously form a new Y particle.We should also account for the fact that proteins are degraded over time by enzymes called proteases. The typical protein’s concentration will be degraded by 20 to 40 percent in an hour, but transcription factors degrade even faster, with each one lasting only a matter of minutes.2 Protein degradation is an important feature  of cellular design, as it allows the cell to remove a protein after increasing that protein’s concentration in response to some environmental change.To model the degradation of Y, we add a “kill” reaction that removes Y particles at some rate. We will initialize our simulation with no Y particles and the X concentration at steady state; since X is being produced at a rate that exactly balances its degradation rate, we will not need to add reactions to the model simulating the production or degradation of X.To complete the model of the first cell, diffusion of the X and Y particles is not technically necessary because no reaction in our model requires the collision of two or more particles. However, for the sake of biological correctness, we will allow both X and Y particles to diffuse through the system at the same rate.STOP: What chemical reaction could be used to simulate the negative autoregulation of Y?The model of the second cell will inherit all of the reactions from the first cell (with the same reaction rates) while adding negative autoregulation of Y, which we will model using the reaction 2Y  Y. That is, when two Y particles collide, there is some probability related to the reaction rate that one of the particles serves to remove the other, which mimics the process of a transcription factor turning off another copy of itself during negative autoregulation.To recap, both simulated cells will include an initial concentration of X at steady state, diffusion of X and Y, removal of Y, and the reaction X  X + Y. The second simulation, which includes negative autoregulation of Y, will add the reaction 2Y  Y. You can explore these simulations in the following tutorial.Visit tutorialNote: Although we are using a particle-based model to mimic regulation, it does not implement specific chemical reactions. In reality, gene regulation is a complicated chemical process that involves a great deal of molecular machinery. The purpose of the model is to strip away the irrelevant and retain the essence of what is being modeled.Ensuring a mathematically controlled comparisonIf you followed the above tutorial, then you were likely disappointed in the second cell and its negative autoregulating transcription factor Y. shows a plot over time of the concentration of Y particles in the two simulated cells, using yellow for the first cell and blue for the second cell.A plot of the concentration of Y particles over time across two simulations. In the first cell (red), we only have activation of Y by X, whereas in the second cell (yellow), we keep all parameters fixed but add a reaction simulating the negative autoregulation of Y.By allowing Y to slow its own transcription, we wound up with a simulation in which the final concentration of Y was lower. How could negative autoregulation possibly be useful?The solution to this quandary is that the model we built was not a fair comparison between the two systems. In particular, the two simulations converge to approximately the same steady state concentration of Y, since achieving this concentration represents the cell’s response to some stimulus. Ensuring this equal footing for the two simulations is called a mathematically controlled comparison.3STOP: How can we change the parameters of our models to obtain a mathematically controlled comparison of the two simulated cells?We should keep a number of parameters constant  across the two simulations because they are unrelated to regulation: the diffusion rates of X and Y, the number of initial particles X and Y, and the degradation rate of Y. With these parameters fixed, the only way that the final steady state concentration of Y can be the same in the two simulations is if we increase the rate at which the reaction X  X + Y takes place in the simulation of the second cell. The following tutorial adjusts this rate parameter to ensure a mathematically controlled comparison.Visit tutorialAn evolutionary basis for negative autoregulationThe figure below plots the concentration over time of Y particles for the two simulated cells after ensuring a mathematically controlled comparison, in which the rate of the reaction X  X + Y has been increased in the second cell. This figure shows that the two simulated cells now have approximately the same steady state concentration of Y. However, the second cell reaches this concentration faster; that is, its response time to the external stimulus that caused an increase in the production of Y is shorter.A comparison of the concentration of Y particles across the same two simulations from the previous figure. This time, in the second simulation (yellow), we increase the rate of the reaction X  X + Y.  As a result, the two simulations have approximately the same steady state concentration of Y, and the simulation that includes negative autoregulation reaches steady state more quickly.The above plots also provide evidence of why negative autoregulation may have evolved. The simulated cell that includes negative autoregulation wins the “race” to a steady state concentration of Y, and so we can conclude that this cell is more fit for survival than one in which Y does not negatively autoregulate. Uri Alon4 has proposed an excellent analogy of a negatively autoregulating transcription factor as a sports car that has both a powerful engine (corresponding to the higher rate of the reaction producing Y) and sensitive brakes (corresponding to negative autoregulation slowing the production of Y to reach equilibrium quickly).In this lesson, we have seen that particle-based simulations can be powerful for justifying why a network motif is prevalent. What are some other commonly occurring network motifs in transcription factor networks? And what evolutionary purposes might they serve?Next lesson            Dobzhansky, Theodosius (March 1973), “Nothing in Biology Makes Sense Except in the Light of Evolution”, American Biology Teacher, 35 (3): 125–129, JSTOR 4444260) &#8617;              Goodsell, David (2009), The Machinery of Life. Copernicus Books. &#8617;              Savageau, M (1976). Biochemical systems analysis: A study of function and design in molecular biology. Addison Wesley. &#8617;              Alon, Uri. An Introduction to Systems Biology: Design Principles of Biological Circuits, 2nd Edition. Chapman &amp; Hall/CRC Mathematical and Computational Biology Series. 2019. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Transcription Factor Networks",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/networks",
        "date"     : "",
        "content"  : "The transcription factor network of E. coliOnce we know which genes each transcription factor regulates, we can consolidate this information into a transcription factor network. The nodes in the network represent an organism’s proteins, and an edge connects X to Y if X is a transcription factor that regulates the expression of protein Y. These edges are one-way connections; any node can have an edge leading into it, but only a transcription factor can have an edge leaving it.The figure below shows a portion of the transcription factor network for Escherichia coli, the workhorse model organism of bacterial study. The complete network, which is the sum of over two decades of biological research, consists of thousands of genes and around 300 transcription factors1. Because of the size of this network, it forms what computational biologists affectionally call a “hairball”, or a network with so many connections that it is functionally impossible to analyze visually. For this reason, we will need to use computational approaches to study this network.Note that the edges in the E. coli transcription factor network below have different colors. An edge connecting X to Y is colored blue if X activates Y, and it is colored orange if X represses Y. (Alternatively, we could label the edges with a “+” or “-“.)A subset of the E. coli transcription factor network2 (click to enlarge). An edge from X to Y denotes that X is a transcription factor that regulates Y. Edges corresponding to activation are colored blue, and edges corresponding to repression are colored orange.STOP: Select the expanded view of the transcription factor network in the figure above. Do you notice anything interesting about this network?Loops in the transcription factor networkYou may have noticed that the E. coli transcription factor network has surprisingly many loops, or edges that connect a node to itself. We will pause to consider the implications of a loop in a transcription factor network  what does it mean for a transcription factor to regulate itself?A transcription factor is a protein, which means that because of the central dogma, the transcription factor is produced as the result of transcription and translation of a gene appearing in an organism’s DNA. In autoregulation, illustrated in the figure below, the transcription factor protein then binds to the DNA in the region preceding the gene that encodes the very same transcription factor. This type of feedback is a beautiful and surprising feature of a simple biological system.A simplified illustration of autoregulation, in which a gene is transcribed into messenger RNA (mRNA) and then translated into a transcription factor protein, and then this transcription factor regulates the same gene, producing a feedback loop. “Protein” labels the transcription factor binding factor protein, which binds to the DNA encoding this transcription factor, labeled by “Gene”.Transcription factor autoregulation leads us to ask two questions. First, how can we justify that a transcription factor network has “surprisingly many” loops? And second, if autoregulation is so common, then why would a transcription factor have evolved to regulate its own transcription? We will address these questions in each of the next two lessons.Next lesson            Gene ontology database with “transcription” keyword: https://www.uniprot.org/. &#8617;              Samal, A. &amp; Jain, S. The regulatory network of E. coli metabolism as a Boolean dynamical system exhibits both homeostasis and flexibility of response. BMC Systems Biology,  2, 21 (2008). https://doi.org/10.1186/1752-0509-2-21 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Biological Oscillators",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/oscillators",
        "date"     : "",
        "content"  : "Oscillators are everywhere in natureEven if placed in a windowless bunker without clocks, humans will maintain a roughly 24-hour cycle of sleep and wakefulness1. This circadian rhythm is present throughout many living things, including plants and even cyanobacteria2. Your heart and respiratory system also follow subconscious cyclical rhythms, and your cells are governed by a strict cell cycle as they grow and divide.We might guess from what we have learned in this module that cyclical biological rhythms must be governed by simple rules. However, the question remains as to what these rules are and how they can correctly execute oscillations over and over.Researchers have identified many network motifs that facilitate oscillation, some of which are very complicated and include many components. In this lesson, we will focus on a simple three-component oscillator motif.The repressilator: a synthetic biological oscillatorThe repressilator motif3 is shown in the figure below. In this motif, all three proteins are transcription factors, and they form a cycle in which X represses Y, Y represses Z, and Z represses X. The repressilator forms a feedback loop, but nothing a priori about this motif indicates that it would lead to oscillation.The repressilator motif for three particles X, Y, and Z. X represses Y, which represses Z, which in turn represses X, forming a feedback loop.STOP: Devise a reaction-diffusion model representing the repressilator.To build a reaction-diffusion model accompanying the repressilator, we start with a quantity of X particles and no Y or Z particles. We assume that all three particles diffuse at the same rate and degrade at the same rate.Furthermore, we assume that all three particles are produced as the result of an activation process by some other transcription factor(s), which we assume happens at the same rate. We will use a single hidden particle I that serves to activate the three visible particles via the three reactions I  I + X, I  I + Y, and I  I + Z, all taking place at the same rate.In the previous lesson on the feed-forward loop, we saw that we can use the reaction X + Y  X to model the repression of Y by X. To complete the repressilator model, we will supplement this reaction with the reactions Y + Z  Y and Z + X  Z, with all three repression reactions occurring at the same rate.Note: To better reflect the biological reality of proteins regulating genes, we will need to add a few additional particles to our model differentiating transcription factor genes from the proteins that they encode. We leave the details to the tutorial below.Visit tutorialInterpreting the repressilator’s oscillationsThe figure below plots the concentration over time of X, Y, and Z particles from our repressilator simulation (colored yellow, red, and blue, respectively). The system exhibits clear oscillatory behavior, with X, Y, and Z taking turns being at high concentration.STOP: Why do you think that the repressilator motif leads to a pattern of oscillations?Modeling the repressilator’s concentration of each particle over time; X is shown in yellow, Y is shown in red, and Z is shown in blue.Because the concentration of X starts out high, with no Y or Z present, the concentration of X briefly increases because its rate of production exceeds its rate of degradation. With no Y or Z particles present, there are none to degrade or be repressed, and so the concentrations of these particles start increasing as well. However, because X particles begin at high concentration, the repression reaction X + Y  X prevents the concentration of Y from growing as fast as the concentration of Z.As the concentration of Z rises, the repression reaction Z + X  Z occurs often enough for the rate of removal of X to equal and exceed its rate of production, accounting for the first (yellow) peak in the figure above. The concentration of X then plummets, with the concentration of Z rising up to replace it. This situation is shown by the second (blue) peak in the figure above.As a result, Z and X have switched roles. Because the concentration of Z is high and the concentration of Y is increasing, the reaction Y + Z  Y will occur frequently and reduce the concentration of Z. Furthermore, because the concentration of X has decreased and the concentration of Y is still relatively low, the reaction X + Y  X will occur less often, allowing the concentration of Y to continue to rise. Eventually, the decrease in the concentration of Z and the increase in the concentration of Y will account for the third (red) peak in the figure above.At this point, the reaction X + Y  X will suppress the concentration of Y. Because the concentration of X and Z are both lower than the concentration of Z, the reaction Z + X  Z will not greatly influence the concentration of X, which will rise to meet the falling concentration of Y, and we have returned to our original situation, at which point the cycle will begin again.Noise is a feature, not a bugTake another look at the figure showing the oscillations of the repressilator. You will notice that the concentrations zigzag as they travel up or down, and that they peak at slightly different levels each time.This noise in the repressilator’s oscillations is due to random variation as the particles bounce around due to diffusion. The repression reactions require two particles to collide, and these collisions may occur more or less often than expected because of random chance, which is exacerbated by low sample size. We have around 150 molecules at each peak in the above figure, but a given cell may have 1,000 to 10,000 molecules of a single protein.4Yet the noise in the repressilator’s oscillations is a feature, not a bug. As we have discussed previously, the cell’s molecular interactions are inherently random. If we see oscillations in a simulation built on randomness, then we can be confident that this simulation is robust to a certain amount of variation.In this module’s conclusion, we will further explore the concept of robustness as it pertains to the repressilator. What happens if our simulation experiences a much greater disturbance to the concentration of one of the particles?  Will it still be able to recover and return to the same oscillatory pattern?Next lesson            Aschoff, J. (1965). Circadian rhythms in man. Science 148, 1427–1432. &#8617;              Grobbelaar N, Huang TC, Lin HY, Chow TJ. 1986. Dinitrogen-fixing endogenous rhythm in Synechococcus RF-1. FEMS Microbiol Lett 37:173–177. doi:10.1111/j.1574-6968.1986.tb01788.x.CrossRefWeb of Science. &#8617;              Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators. Nature. 2000;403(6767):335-338. doi:10.1038/35002125 &#8617;              Brandon Ho, Anastasia Baryshnikova, Grant W. Brown. Unification of Protein Abundance Datasets Yields a Quantitative Saccharomyces cerevisiae Proteome. Cell Systems, 2018; DOI: 10.1016/j.cels.2017.12.004 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Principal Components Analysis",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/pca",
        "date"     : "",
        "content"  : "The curse of dimensionalityThings get weird in multi-dimensional space.Consider a circle inscribed in a square, as shown in the figure below. The ratio of the area of the circle to the area of the square is π/4  0.785, regardless of the square’s side length. When we move to three dimensions and have a sphere inscribed in a cube, the ratio of the volume of the sphere to the volume of the cube is (4π/3)/8  0.524.A circle inscribed in a square takes up more of the square (78.5 percent) than a sphere inscribed in a cube (52.4 percent).We define an n-dimensional unit sphere as the set of points in n-dimensional space whose Euclidean distance from the origin is at most 1, and an n-dimensional cube as the set of points whose coordinates are all between 0 and 1. A precise definition of the volume of a multi-dimensional object is beyond the scope of our work, but as n increases, the sphere takes up less and less of the cube. As n tends toward infinity, the ratio of the volume of the n-dimensional unit sphere to the volume of the n-dimensional unit cube approaches zero!One way of interpreting the vanishing of the sphere’s volume is that as n increases, an n-dimensional cube has more and more corners in which points can hide from the sphere. Most of the cube’s volume therefore winds up scattering outward from its center.The case of the vanishing sphere may seem like an arcane triviality that holds interest only for mathematicians toiling in fluorescently lit academic offices at strange hours. Yet this phenomenon is just one manifestation of a profound paradigm in data science called the curse of dimensionality, which is a collection of principles that arise in higher dimensions that run counter to our intuition about three-dimensional space.How the curse of dimensionality affects classificationIn the previous lesson, we discussed sampling n points from the boundary of a two-dimensional WBC nuclear image, thus converting the image into a vector in a space with 2n dimensions. We argued that n needs to be sufficiently large to ensure that comparing the vectors of two images will give an accurate representation of how similar their shapes are. Yet increasing n means that we need to be careful about the curse of dimensionality.Say that we sample k points randomly from the interior of an n-dimensional cube. Let dmin and dmax denote the minimum and maximum distance from any of our points to the origin, respectively. As n grows, the ratio dmin/dmax heads toward 1; in other words, the minimum distance between points becomes indistinguishable from the maximum distance between points.This other facet of the curse of dimensionality means that algorithms like k-NN, which classify points with unknown classes based on nearby points with known classes, may not perform well in higher-dimensional spaces in which even similar points tend to fly away from each other.Because of the curse of dimensionality, it makes sense to reduce the number of dimensions before performing any further analysis such as classification. One way to reduce the number of dimensions would be to reduce the number of features used for generating a vector, especially if we have reason to believe that some features are more informative than others. This approach will likely not work for our WBC image example, since it is not clear why one point on the boundary of our images would be inherently better than another, and we already know about the dangers of undersampling points.Instead, we will reduce the number of dimensions of our shape space without removing any features from the data. The concept of reducing the dimension of a space may be non-intuitive, and so we will explain dimension reduction in the context of two- and three-dimensional space; our approach may be more familiar than you think.Dimension reduction with principal components analysisWe will introduce dimension reduction using the iris flower dataset that we introduced when discussing classification. Although this dataset has four features, we will focus again on only petal length and width, which we plot against each other in the figure below. We can trust our eyes to notice the clear pattern: as iris petal width increases, petal length tends to increase as well.Petal length (x-axis) plotted against petal width (y-axis) for all flowers in the iris flower dataset.A line drawn through the center of the data (see figure below) provides a reasonable estimate of a flower’s petal width given its length, and vice-versa. This line, a one-dimensional object, therefore approximates a collection of points in two dimensions.A line passing through the plot of iris petal length against petal width. The line tells us approximately how long we can expect an iris petal to be given the petal’s width, and vice-versa.STOP: How could we have determined the line in the figure above?Long ago in math class, you may have learned how to choose a line to best approximate a two-dimensional dataset using linear regression, which we will now briefly describe. In linear regression, we first establish one variable as the dependent variable, which is typically assigned to the y-axis. In our iris flower example, the dependent variable is petal width.Given a line, we use L(x) to denote the y-coordinate of the point on the line corresponding to a given x-coordinate. For this line, we can then define the residual of a data point (x, y) as the difference y - L(x) between its y-coordinate and the y-coordinate on the line corresponding to x. If a residual is positive, then the data point lies “above” the line, and if the residual is negative, then the point lies “below” the line (see figure below).An example line and data points with a visual depiction of the points’ residuals (dashed lines), which visualize the differences in actual y-values and those estimated by the line. The absolute value of a residual is the length of its dashed line, and the sign of a residual corresponds to whether it lies above or below the line.As the line changes, so will the points’ residuals. The smaller the residuals become, the better the line fits the points. In linear regression, we are looking for the line that minimizes the sum of squared residuals.Linear regression is not the only way to fit a line to a collection of data. Choosing petal width as the dependent variable makes sense if we want to explain petal width as a function of petal length, but if we were to make petal length the dependent variable instead, then linear regression would minimize the sum of squared residuals in the x-direction, as illustrated in the figure below.If x is the dependent variable, then the residuals with respect to a line become the horizontal distances between points and the line, and linear regression finds the line that minimizes the sum of the squares of these horizontal residuals over all possible lines through the data.Note: The linear regression line will likely differ according to which variable we choose as the dependent variable, since the quantity that we are minimizing changes. However, if a linear pattern is present in our data, then the two regression lines will be similar.STOP: For the iris flower dataset, which of the two choices for dependent variable is better?The preceding question is implying that no clear causality underlies the correlation between petal width and petal length, which makes it difficult to prioritize one variable over the other as the dependent variable. For this reason, we will revisit how we are defining the line that best fits the data.Instead of considering residuals based on distances to the line in only the x-direction or the y-direction, we can treat both variables equally. To do so, we examine the distance from each data point to its nearest point on the line (see figure below), which is called the projection of the point onto the line. The line that minimizes the sum of the squared distances between each point and its projection onto the line is called the \textdefnogloss{first principal component} of the data.Instead of considering residuals based on distances to the line in only the x-direction or the y-direction, we can instead examine the distances from our data points to the line, as shown in the figure below. The line minimizing the sum of the squares of these distances treats each of the two variables equally and is called the first principal component of the data.A line along with a collection of points; dashed lines show the shortest segments connecting each data point to its projection onto the line, which is the point on the line that is closest to the data point.The first principal component is often said to be the line that “explains the most variance in the data”. If a correspondence exists between lily petal width and length, then the distances from points to the first principal component correspond to variation due to randomness. By minimizing the sum of squares of these distances, we limit the amount of variation in our data that we cannot explain with a linear relationship.The following animated GIF shows a line rotating through a collection of data points, with the distance from each point to the line shown in red. As the line rotates, we can see the distances from the points to the line change.An animated GIF showing that the distances from points to their projections onto a line change as the line rotates. The line of best fit is the one in which the sum of the square of these distances is minimized.  Source: amoeba, StackExchange user.1Another benefit of finding the first principal component of a dataset is that it allows us to reduce the dimensionality of our dataset from two dimensions to one. In the figure above, the projection of each point onto the line is shown in red. The projections of a collection of data points onto their first principal component gives a one-dimensional representation of the data.Say that we wanted to generalize these ideas to three-dimensional space. The first principal component would offer a one-dimensional explanation of the variance in the data, but perhaps a line is insufficient to this end. The points could lie very near to a plane (a two-dimensional object), and projecting these points onto the nearest plane would effectively reduce the dataset to two dimensions, as shown in the figure below.(Top) A collection of seven points, each labeled with a different color. Each point is projected onto the plane that minimizes the sum of squared distances between points and the plane. The line indicated is the first principal component of the data; this line lies within the plane, which is the case for any dataset. (Bottom) A reorientation of the plane such that the first principal component is shown as the x-axis, with colored points corresponding to the projections onto the plane from the top figure. The y-axis of this plane is known as the “second principal component” of the data.Our three-dimensional minds will not permit us the intuition needed to visualize the extension of this idea into higher dimensions, but we can generalize these concepts mathematically. Given a collection of m data points (or vectors) in n-dimensional space, we are looking for a d-dimensional hyperplane, or an embedding of d-dimensional space inside n-dimensional space, such that the sum of squared distances from the points to the hyperplane is minimized. By taking the projections of points to their nearest point on this hyperplane, we reduce the dimension of the dataset from n to d.This approach, which is over 100 years old but omnipresent in modern data science, is called principal component analysis (PCA). A closely related concept called singular value decomposition was developed in the 19th century.Note: It can be proven that for any dataset, when d1 is smaller than d2, the hyperplane provided by PCA of dimension d1 is a subset of the hyperplane of dimension d2. For example, the first principal component is always found within the plane (d = 2) provided by PCA, as indicated in the preceding figure.We will soon apply PCA to reduce the dimensions of our shape space; first, we make a brief aside to discuss a different biological problem in which the application of PCA has provided amazing insights.Genotyping: PCA is more powerful than you thinkBiologists have identified hundreds of thousands of markers, locations within human DNA that that are common sources of human variation. The most commonly type of marker is a single nucleotide (A, C, G, or T). In the process of genotyping, a service provided by companies as part of the booming ancestry industry, an individual’s markers are determined from a DNA sample.An individual’s n markers can be converted to an n-dimensional vector v = (v1, v2, …, vn) such that vi is 1 if the individual possesses the variant for a marker and vi is 0 if the individual has the more common version of the marker.Note: The mathematically astute reader will notice that this vector lies on one of the many corners of an n-dimensional hypercube.Because n is so large  and in the early days of genotyping studies it far outnumbered the number of individual samples  we need to be wary of the curse of dimensionality. When we apply PCA with d = 2 to produce a lower-dimensional projection of the data, we will see some amazing results that helped launch a multi-billion dollar industry.The figure below shows a two-dimensional projection for individuals of known European ancestry. Even though we have condensed hundreds of thousands of dimensions to just two, and even though we are not capturing any information about the ancestry of the individuals other than their DNA, the projected data points reconstruct the map of Europe.The projection of a collection of marker vectors sampled from individuals of known European ethnicity onto the plane produced by PCA with d = 2. Individuals cluster by country, and neighboring European countries remain nearby in the projected dataset.2If we zoom in on Switzerland, we can see that the countries around Switzerland tend to pull individuals toward them based on language spoken (see figure below).A PCA plot (d = 2) of individuals from Switzerland as well as neighboring countries shows that an individual’s mother tongue correlates with the individual’s genetic similarity to representatives from the neighboring country where that language is spoken.2And if we zoom farther out, then we can see continental patterns emerge, with India standing out as its own entity. What is particularly remarkable about all these figures is that humans on the whole are genetically very similar, and yet PCA is able to find evidence of human migrations and separation lurking within our DNA.A PCA plot (d = 2) shows clustering of individuals from Europe, Asia, Africa, and India.3Now that we have established the power of PCA to help us see patterns in high-dimensional biological data, we are ready to use CellOrganizer to build a shape space for our WBC images and apply PCA to this shape space to produce a lower-dimensional representation of the space that we can visualize.Note: Before visiting this tutorial, we should point out that CellOrganizer is a much more flexible and powerful software resource than what is shown in the confines of this tutorial. For example, CellOrganizer not only infers properties from cellular images, it is able to build generative models that can form simulated cells in order to infer cellular properties. For more on what CellOrganizer can do, consult the publications page at its homepage.Visit tutorialVisualizing the WBC shape space after PCAThe figure below shows the shape space of WBC images, reduced to three dimensions by PCA, in which each image is represented by a point that is color-coded according to its cell family.The projection of each WBC shape vector onto a three-dimensional PCA hyperplane produces the above three-dimensional space. Granulocytes are shown in blue, lymphocytes are shown in orange, and monocytes are shown in green.We can also subdivide granulocytes into basophils, eosinophils, and neutrophils. Updating our labels according to this subdivision produces the following figure.The reduced dimension shape space from the previous figure, with granulocytes subdivided into basophils, eosinophils, and neutrophils.Although images from the same family do not cluster as tightly as the iris flower dataset  which could be criticized as an unrealistic representation of real datasets  images from the same type do appear to be nearby. This fact should give us hope that proximity in a shape space of lower dimension may help us correctly classify images of unknown family.Next lesson            Amoeba, Stack Exchange user. Making sense of principal component analysis, eigenvectors &amp; eigenvalues, Stack exchange URL (version: 2021-08-05): https://stats.stackexchange.com/q/140579 &#8617;              Novembre J et al (2008) Genes mirror geography within Europe. Nature 456:98–101. Available online &#8617; &#8617;2              Xing J et al (2009) Fine-scaled human genetic structure revealed by SNP microarrays. Genome Research 19(5): 815-825. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Random Walks Model Diffusion",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/random_walk",
        "date"     : "",
        "content"  : "The wanderlust of a randomly walking particleYou may feel like a single, coherent being, but you are just a skin-covered bag of trillions of cells that act largely independently. Cells are full of proteins, complex macromolecules that perform nearly every cellular function. If a protein could move in a straight line, then it would move at 20 kph or faster1, meaning that the protein would cover a distance 1 billion times its length every second (analogous to a car traveling at 20 billion kph). However, the cytoplasm filling the cell is so densely packed with water molecules that the protein ping-pongs off them, frequently changing direction.We will model the movements of a cellular particle such as a protein by a random walk in a two-dimensional plane. At each step, the particle moves a single unit of distance in a randomly chosen direction. The video below shows a randomly walking particle taking 1000 steps.      The distance that the particle wanders from its starting point may surprise you. And yet perhaps this particle is just an outlier, and the typical particle would be much more of a homebody.From one particle to manyIf we animate the action of many independent particles following random walks, then although some particles hug the starting point and some wind up far away, most particles steadily drift outward. The following video shows a simulation of multiple randomly walking particles that all begin their walk at the same point and that diffuse throughout their environment.      Although the movements of a single particle are random, we can draw conclusions about the average-case behavior of many particles can be predicted, as the following theorem indicates.Random Walk Theorem: After n steps of unit length in a random walk, a particle will on average find itself a distance from its origin that is proportional to \(\sqrt{n}\).Note: If you love mathematics and are interested in seeing a proof of this theorem, click here.Our experience of the world confirms the Random Walk Theorem’s statement that randomly walking particles tend to drift away from their starting point. We understand, for example, that someone who has a respiratory virus can infect many others within an enclosed space as the viral particles expand outward. We also know that when a cake is baking in the oven, we will not need to wait long for delicious smells to waft from the kitchen.If you are interested in seeing how to build the random walk simulation shown in the video above, then please visit the following software tutorial. This tutorial uses CellBlender, an add-on to the popular open graphics software program Blender, which allows us to create and visualize biological models. These models rely on particle-based reaction-diffusion simulations that are implemented by the program MCell. We will use this software for our work in biological modeling in this prologue as well as module 1.Note: We have designed this course so that you can appreciate the key ideas behind the biological models that we build without following software tutorials. But we also provide these tutorials so that you can explore the modeling software that we have used to generate our conclusions.Visit tutorialBrownian motion: big numbers in small spacesLater in this course, we will see that random walks power a simple but powerful approach that bacteria like E. coli use to explore their environment in the hunt for food. In the next lesson, we will see that randomly moving particles can produce high-level patterns if the particles interact when they collide.Before continuing, we will point you to a beautiful animation illustrating just how far a single randomly moving particle can travel in a relatively small amount of time. This animation, which shows a simulation of the path taken by a glucose molecule as the result of Brownian motion, starts at 6:10 of the following excellent video developed by the late Joel Stiles.      Next lesson            Goodsell, David (2009), The Machinery of Life. Copernicus Books. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "A Reaction-Diffusion Model Generating Turing Patterns",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/reaction-diffusion",
        "date"     : "",
        "content"  : "From random walks to reaction-diffusionIn the previous lesson, we introduced the model of a particle randomly walking through a medium. But what exactly do random walks have to do with Alan Turing and zebras?Turing’s insight was that remarkable high-level patterns could arise if we combine particle diffusion with chemical reactions, in which colliding particles interact with each other. Such a model is called a reaction-diffusion system, and the emergent patterns are called Turing patterns in his honor.We will consider a reaction-diffusion system having two types of particles, A and B. The system is not explicitly a predator-prey relationship, but you may like to think of the A particles as prey and the B particles as predators.Both types of particles diffuse randomly through the plane, but the A particles diffuse more quickly than the B particles.  In the simulation that follows, we will assume that A particles diffuse twice as quickly as B particles. In terms of our random walk model, this faster rate of diffusion means that in a single “step”, an A particle moves twice as far as a B particle.STOP: Say that we release a collection of A and B particles at the same location. If the particles move via random walks, and the A particles diffuse twice as fast as the B particles, then on average, how much farther from the origin will the A particles be than the B particles after n steps?We now will add three reactions to our system. First, A particles are added into the system at some constant feed rate f. As a result of the feed reaction, the concentration of the A particles increases by a constant number in each time step.Note: We will work with a two-dimensional simulation, but in a three-dimensional system, the units of f would be in mol/L/s, which means that every second, there are f moles of particles added to the system for every liter of volume. (Recall from your chemistry class long ago that one mole is 6.02214076 · 1023 particles, called Avogadro’s number.)Second, B particles are removed from the system at a constant kill rate k. As a result of the kill reaction, the number of B particles in the system decreases by a factor of k in a given time step. That is, the more B particles that are present, the more B particles are removed.Third, our reaction-diffusion system includes the following reaction involving both particle types. The particles on the left side of this reaction are called reactants and the particles on the right side are called products:A + 2B  3B.To simulate this reaction on a particle level, if an A particle and two B particles collide with each other, then the A particle has some fixed probability of being replaced by a third B particle, which could vary in practice based on the presence of a catalyst and the orientation of the particles when they collide. This probability directly relates to the rate at which the reaction occurs, denoted r. This third reaction is why we compared A to prey and B to predators, since we may like to conceptualize the reaction as two B particles consuming an A particle and producing a new offspring B particle.The three reactions defining our system are summarized by the figure below.                                                                        A visualization of our reaction-diffusion system at a moment in time showing updates due to reactions. (Left) The system contains both types of particles and two collisions. The two A particles shown with dashed lines are not yet present. (Right) Two of the A particles are fed into the system, two of the B particles die out, and a B particle replaces an A particle after the collision of two B particles and an A particle.  Before continuing, we call your attention to a slight difference between the feed and kill reactions. In the feed reaction, the concentration of A particles increases by a constant quantity in each time step. In the kill reaction, the concentration of B particles decreases by a constant factor multiplied by the current concentration of B particles. If we were using calculus to model this system, then letting [A] and [B] denote the concentrations of the two particle types, we would represent the feed and kill reactions by the following two differential equations:d[A]/dt = f;d[B]/dt = -k · [B].Parameters are omnipresent in biological modelingA parameter is a variable quantity used as input to a model. Parameters are inevitable in biological modeling (and data science in general), and as we will see, changing parameters can cause major changes in the behavior of a system.Four parameters are relevant to our reaction-diffusion system: the feed rate (f) of A particles, the kill rate (k) of B particles, the predator-prey reaction rate (r), and the diffusion rate (i.e., speed) of B particles. We do not need to set the diffusion rate of A particles, since we know that their diffusion is twice that of B particles.We think of all these parameters as dials that we can turn, observing how the system changes as a result. For example, if we raise the diffusion rate, then the particles will be moving around and colliding with each other more, which means that we will see more of the reaction A + 2B  3B.STOP: What will happen as we increase or decrease f or k?In the following tutorial, we will initiate our reaction-diffusion system with a uniform concentration of A particles spread across a two-dimensional plane and a tightly packed collection of B particles in the center of the plane. When we return from this tutorial, we will see if any high-level patterns form.Visit tutorialChanging reaction-diffusion parameters produces different emergent Turing patternsFor many parameter values, our reaction-diffusion system is not very interesting.  The following animation is produced when using parameter rates in CellBlender of f = 1000 and k = 500,000.  It shows that if the kill rate is too high, then B particles (colored red) will die out more quickly than they can be replenished by the reaction with A particles (colored green), and so only A particles will remain.      Conversely, if f is too high, then the increased feed rate will cause an increase in the concentration of A particles. However, the increased concentration of A particles will also lead to more collisions between A particles and pairs of B particles, causing the concentration of B particles to explode. The following simulation has the parameters f = 1,000,000 and k = 100,000.      The interesting behavior in this system lies in a sweet spot of f and k parameter values. For example, consider the following visualization when f is equal to 100,000 and k is equal to 200,000. We see a clear stripe of B particles expanding outward against a background of A particles, with subsequent stripes appearing at locations where a critical mass of B particles can interact with each other.      When we hold k fixed and increase f to 140,000, the higher feed rate increases the likelihood of B particles encountering A particles, and so we see even more stripes of B particles.      As f approaches k, the stripe structure becomes chaotic and breaks down because many clusters of B particles are now colliding and mixing. The following animation shows the result of raising f to 175,000.      Once f is equal to k, the stripes disappear, as shown in the video below. We might expect this to mean that the A and B particles will be uniformly mixed. Instead, we see that after an initial outward explosion of B particles, the system displays a mottled background. Pay attention to the following video at a point late in the animation. Although the concentrations of the particles are still changing, there is much less large-scale change than in earlier videos. If we freeze the video, our eye cannot help but see patterns of red and green clusters that resemble mottling.The Turing patterns that emerged from our particle simulations are a testament to the human eye’s ability to find organization within the net behavior of tens of thousands of particles. For example, take another look at the video we produced that showed mottling in our particle simulator. Patterns in this simulation are noisy  even in the dark red regions we will have quite a few green particles, and vice-versa. The rapid inference of large-scale patterns from small-scale visual phenomena is one of the tasks that our brains have evolved to perform well.      You may still be skeptical, since the patterns in the above videos do not have the concrete boundaries that we might expect of animal stripes and spots. Yet when we examine the skin an animal exhibiting Turing patterns, we see the effect of a pointillist painting: the patterns that we infer on a higher level are just the net result of many varied individual points. The figure below shows an example of this effect for zebrafish skin.Zooming in on the striped skin of a zebrafish shows that each stripe is formed of many differently colored cells, and that the boundaries of the stripes are more variable than they may seem at lower resolution. Image courtesy: JenniferOwen, Wikimedia Commons (adapted by Kit Yates).Turing’s patterns and Klüver’s form constantsThe particle simulations in the previous subsection may evoke the adjective “trippy”. This is no accident.Research dating to the 1920s has studied the patterns that humans see during visual hallucinations, which Heinrich Klüver named form constants after studying patients who had taken mescaline.1 These patterns, such as cobwebs, tunnels, and spirals, occur across many individuals, regardless of the cause of their hallucinations.A few of Heinrich Klüver’s form constants. Image courtesy: Lisa Diez, Wikimedia Commons.Over five decades after Klüver’s work, researchers would determine that form constants originate from simpler linear stripes of cellular activation patterns in the retina. When the brain passes these linear patterns from the circular retina to the rectangular field of view interpreted by the visual cortex, it contorts these stripes into form constants.2 Some researchers3 believe that the patterns in the retina caused by hallucinations are in fact Turing patterns and can be explained by a reaction-diffusion model in which one type of neuron acts as a predator and another acts as prey, but this hypothesis remains unresolved.Streamlining our simulationsDespite using advanced modeling software that has undergone years of development and optimization, each of the simulations in this lesson took several hours to render because they require us to track the movement of tens of thousands of particles over thousands of generations.We wonder if it is possible to build a model of Turing patterns that does not require so much computational overhead. In other words, is there a simplifying speed-up that we can make to our reaction-diffusion model that will still produce Turing patterns? We will turn our attention to this question in the next lesson.Next lesson            H. Klüver. Mescal and Mechanisms of Hallucinations. University of Chicago Press, 1966. &#8617;              G.B. Ermentrout and J.D. Cowan. “A Mathematical Theory of Visual Hallucination Patterns”. Biol. Cybernetics 34, 137-150 (1979). &#8617;              J. Ouellette. “A Math Theory for Why People Hallucinate”. Quanta Magazine, July 30, 2018. https://www.quantamagazine.org/a-math-theory-for-why-people-hallucinate-20180730/ &#8617;      "
     
   } ,
  
   {
     
   } ,
  
   {
     
        "title"    : "Segmenting White Blood Cell Images",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/segmentation",
        "date"     : "",
        "content"  : "Image segmentation requires a tailored approachWe begin our work by discussing how to segment WBC nuclei from blood cell images like the one below, reproduced from the introduction.The granulocyte presented in the introduction (having ID 3 in our dataset).Researchers have developed many algorithms for cellular image segmentation, but no single approach can be used in all contexts. We therefore will identify the key attributes that make this dataset special, which we will use to develop our own segmentation algorithm.What makes the WBC nucleus so easy for a human to spot in the above blood cell? You may be screaming, “It is dark blue! How hard could it be?” But to train a computer to segment images by color, we should first understand how the computer encodes color in images.The RGB color modelIn the RGB color model, every rectangular pixel on a computer screen emits a single color formed as a mixture of differing amounts of the three primary colors of light: red, green, and blue (hence the acronym “RGB”). The intensity of each primary color in a pixel is expressed as an integer between 0 and 255, inclusively, with larger integers corresponding to greater intensities.A few colors are shown in the figure below along with their RGB equivalents; for example, magenta corresponds to equal parts red and blue. Note that a color like (128, 0, 0) contains only red but appears duskier than (256, 0, 0) because the red in that pixel is less intense.A collection of colors along with their RGB codes. This table corresponds to mixing colors of light instead of pigment, which causes some non-intuitive effects; for example, yellow is formed by mixing equal parts red and green. The last six colors appear muted because they only receive half of a given color value compared to a color that receives 256 units. If all three colors are mixed in equal proportions, then we obtain a color on the gray scale between white (255, 255, 255) and black (0, 0, 0). Source: Excel at Finance.The RGB model gives us an idea for segmenting a WBC nucleus. If we scan through the pixels in a blood cell image, we can ignore any pixels whose RGB color values are not sufficiently blue; hopefully, the remaining pixels are found in the WBC nucleus.STOP: You can find a color picker in Utilities &gt; Digital Color Meter (Mac OS X) or by using ShareX (Windows). Open your color picker, and hover the picker over different parts of the granulocyte image above. What are the typical RGB values for the WBC nucleus, and how do these RGB values differ from those of the RBCs and the image background?Binarizing an image based on a color thresholdWe will binarize each blood cell image by coloring a pixel white if its blue value is above some threshold, and coloring a pixel black if its blue value is beneath some threshold.The binarized version of the above granulocyte image using the threshold value of 153 is shown in the figure below. Unfortunately, we cannot clearly see the WBC nucleus in this binarized image because although the nucleus’s pixels have high blue values, so do those of the image’s background, which have high intensities of red, green, and blue, producing the background’s light appearance.A binarized version of the granulocyte image from the previous figure (having image ID 3 in our dataset). A pixel is colored white if it has a blue value of 153 or greater, and a pixel is colored black otherwise. The region with the nucleus is not clearly visible because much of the original image’s background is light, and so its pixels have large red, green, and blue values.STOP: How might we modify our segmentation approach to perform a binarization that identifies the WBC nucleus more effectively?We were unable to distinguish between the image background and the WBC nucleus using blue color values, but pixels in the WBC nucleus tend to have a green value that is much lower than the image background and a red value that is lower than every other part of the image. The figure below shows two binarizations of the original image using a green threshold of 153 (left) and a red threshold of 166 (Right).                                                                        Two more binarized versions of the granulocyte image from the figure above, based on the green and red channels. (Left) A binarization in which a pixel is colored white if it has a green value less than or equal to 153, and a pixel is colored black otherwise. (Right) A binarization in which a pixel is turned white if it has a red value less than or equal to 166, and a pixel is colored black otherwise.  It might seem that we should work with the binarized image based on the red threshold, which contains the clearest image of the nucleus among the three binarized images. However, note that each threshold was successful in eliminating some of the non-nuclear parts of the image. For example, the white regions in the top left of both binarized images in the figure above was eliminated by the binarized image based on the blue threshold, which initially did not seem helpful.This insight gives us an idea: if each of the three binarized images based on thresholding a single color channel was successful at excluding some part of the image, then let us produce a fourth image in which a pixel is colored white only if it is white in all three binarized images, and a pixel is colored black if it is black in any of the three binarized images. In the following tutorial, we will build an R pipeline that implements this approach to produce binarized WBC nuclei for all our blood cell images.Visit tutorialSuccessful segmentation is subject to parametersIf you followed the above tutorial, then you may be tempted to celebrate, since it seems that we have resolved our first main objective of identifying WBCs. Indeed, if we segment all of the images in the dataset via the same process, then we typically obtain a nice result, as indicated in the figure below for the sample monocyte and lymphocyte images presented in the introduction.                                                                        Image segmentation of the monocyte (left) and lymphocyte (right) corresponding to IDs 15 and 20 in the provided dataset.  At the same time, no segmentation pipeline is perfect. The figure below illustrates that for a few images in our dataset, we may not correctly segment the entire nucleus.                                                                        (Left) An image of a WBC (ID: 167). (Right) The binarization of this image, showing that the nucleus is not correctly identified during segmentation using the parameters described in this lesson.  We can continue to tweak threshold parameters, but our relatively simple algorithm successfully segments almost every WBC nucleus from our dataset, and we are ready to move on to classify WBC nuclei into families.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Shape Spaces",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/shape_space",
        "date"     : "",
        "content"  : "Stone tablets and lost citiesImagine that you are a traveler to Earth and come across the ruins of New York City. You find an old road atlas that has a table of driving distances between cities (in miles), shown in the table below. Can you use this atlas to find the other cities in the table? In an earlier module, we encountered a “Lost Immortals” problem; this problem, of inferring the locations of cities given the distance between them, we call “Lost Cities”.                   New York      Los Angeles      Pittsburgh      Miami      Houston      Seattle                  New York      0      2805      371      1283      1628      2852              Los Angeles      2805      0      2427      2733      1547      1135              Pittsburgh      371      2427      0      1181      1388      2502              Miami      1283      2733      1181      0      1189      3300              Houston      1628      1547      1388      1189      0      2340              Seattle      2852      1135      2502      3300      2340      0      STOP: If you know the locations of New York and Seattle, how could you use the information in the table above to find the other cities?This seemingly contrived example has a real archaeological counterpart. In 2019, researchers used commercial records that had been engraved by Assyrian merchants onto 19th Century BCE stone tablets in order to estimate distances between pairs of lost Bronze age cities in present-day Turkey. Using this “atlas” of sorts, they estimated the locations of the lost cities.1You may be confused as to why biologists should care about stone tablets and lost cities. For now, we will return to our problem of classifying segmented WBC images by family.Vectorizing a segmented imageAs we mentioned in the previous lesson, we would like to apply k-NN to our example of segmented WBC images. Yet k-NN first requires each object to be represented by a feature vector, and so we need some way of converting a WBC image into a feature vector. In this way, we can produce a shape space, or an assignment of (cellular image) shapes to points in multi-dimensional space.You may notice that the problem of “vectorizing” a WBC image is similar to one that we have already encountered in our module on protein structures. In that module, we vectorized a protein structure S as the collection of locations of its n alpha carbons to produce a vector  s = (s1, …, sn), where si is the position of the i-th alpha carbon of S.We will apply the same idea to vectorize our segmented WBCs. Given a binarized WBC nucleus image, we will first center the image so that its center of mass is at the origin, and then sample n points from the boundary of the cell nucleus to produce a shape vector s = (s1, …, sn), where si is a point with coordinates (x(i), y(i)).Note: Both isolating the boundary of a binarized image and sampling points from this boundary to ensure that points are similarly spaced are challenging tasks that are outside the scope of our work here, and which we will let CellOrganizer handle for us.To determine the “distance” between two images’ shape vectors, we will use our old friend root mean square deviation (RMSD), which is very similar to the Euclidean distance. Recall that the RMSD between shape vectors s and t is\[\text{RMSD}(\mathbf{s}, \mathbf{t}) = \sqrt{\dfrac{1}{n} \cdot [d(s_1, t_1)^2 + d(s_2, t_2)^2 + \cdots + d(s_n, t_n)^2]}\,.\]Inferring a shape space from pairwise distancesIt is tempting to take the vectorization of every shape as our desired shape space. If this were the case, then we would hope that images of similarly shaped nuclei would have low RMSD and that the more dissimilar two nuclei become, the higher the RMSD of their shape vectors. The potential issues with this assumption are the same as those encountered when discussing protein structures, which we now review.On the one hand, we need to ensure that the number of points that we sample from the object boundary is sufficiently high to avoid dissimilar shapes from having low RMSD. For this reason, CellOrganizer samples n = 1000 points by default for cell nuclei.On the other hand, we could have very similar shapes whose RMSD winds up being high. For example, recall the shapes in the figure below, which are identical, but one has been flipped and rotated. If we were to vectorize these shapes as they are now in the same way (say, by starting at the top of the shape and proceeding clockwise), then we would obtain two vectors with high RMSD.Two identical shapes, with one shape flipped and rotated. Vectorizing these shapes without first correctly aligning them will produce two vectors with high RMSD.We handled the latter issue in our work on protein structure comparison by introducing the Kabsch algorithm, which identified the best rotation of one shape into another that would minimize the RMSD of the resulting vectors. Yet what makes our work here more complicated is that we are not comparing  two WBC image shape vectors, we are comparing hundreds.We could apply the Kabsch algorithm to every pair of images, producing the RMSD between every pair of images. We would then need to build a shape space from all these distances between pairs of shapes. We hope that this problem sounds familiar, as it is the Lost Cities problem in disguise. The pairs of distances between images correspond to a road atlas, and placing images into a shape space corresponds to locating cities.Note: CellOrganizer includes one model that applies an alternative approach to the Kabsch algorithm for computing a cellular distance, called the diffeomorphic distance2, which can be thought of intuitively as determining the amount of energy required to deform one shape into another.Statisticians have devised a collection of approaches called multi-dimensional scaling to solve versions of the Lost Cities problem that arise frequently in practice. The fundamental idea of multi-dimensional scaling is to assign points to n-dimensional space such that the distances between points in this space approximately resemble a collection of distances between pairs of objects in some dataset.STOP: If we have m cellular images, then how many times will we need to compute the distance between a pair of images?Aligning many images concurrentlyUnfortunately, for a large image dataset, computing the distance between every pair of images can prove time-intensive, even with a powerful computer. Instead, we will rotate all images concurrently. After this alignment, we can then center and vectorize all the images starting at the same position.One way of aligning a collection of images is to first identify the major axis of each image, which is the longest line segment that connects two points on the outside of the image and crosses through the image’s center of mass. The figure below shows the major axis for a few similar shapes.Three similar shapes, with their major axes highlighted in gray.Aligning the major axes of these similar shapes reveals their similarities (see figure below). These images are ready to be vectorized (say, starting from the point on the right side of an image’s major axis and proceeding counterclockwise). The resulting vectors will have low RMSD because corresponding points on the shapes will be nearby.Aligning the three images from the previous figure so that their major axes overlap allows us to see similarities between the shapes as well as build shape vectors for them having a consistent frame of reference.Note: In practice, when we align shapes along their major axes, we need to consider the flip of each shape across its major axis as well. Handling this issue is beyond the scope of our work here but is discussed in the literature.3By aligning and then vectorizing a collection of binarized cellular images after alignment, the resulting feature vectors form our desired shape space. We are almost ready to apply a classifier to this shape space, but one more pitfall remains.Next lesson            Barjamovic B, Chaney T, Coşar K, Hortaçsu A (2019) Trade, Merchants, and the Lost Cities of the Bronze Age. The Quarterly Journal of Economics 134(3):1455-1503.Available online &#8617;              Rohde G, Ribeiro A, Dahl K, Murphy F (2008) Deformation-based nuclear morphometry: capturing nuclear shape variation in hela cells. Cytometry Part A 73:341–350.Available online &#8617;              Pincus Z, Theriot J (2007) Comparison of quantitative methods for cell-shape analysis. Journal of Microscopy 227(Pt 2):140-56.Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Signaling and Ligand-Receptor Dynamics",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/signal",
        "date"     : "",
        "content"  : "Cells detect and transduce signals via receptor proteinsChemotaxis is one of many ways in which a cell must perceive a change in its environment and react accordingly. This response is governed by a process called signal transduction, in which a cell identifies a stimulus outside the cell and then transmits this stimulus into the cell.When a certain molecule’s extracellular concentration increases, receptor proteins on the outside of the cell have more frequent binding with these molecules and are therefore able to detect changes in molecular concentration. This signal is then “transduced” via a series of internal chemical processes.For example, transcription factors, which we discussed in the previous module, are involved in a signal transduction process. When some extracellular molecule is detected, a cascade begins that eventually changes a transcription factor into an active state, so that it is ready to activate or repress the genes that it regulates.In the case of chemotaxis, E. coli has receptor proteins that detect attractants such as glucose by binding to and forming a complex with these attractant ligands. The bacterium also contains receptors to detect repellents, but we will focus on modeling the binding of a single type of receptor to a single type of attractant ligand.  In later lessons, we will enter the cell and model the cascade of reactions after this binding has occurred, as shown in the figure below, which cause a change in the rotation of one or more flagella.A high-level overview of the chemotaxis signaling pathway. The red circles labeled L represent attractant ligands. When these ligands bind to receptors, a signal is transduced inside the cell via a series of enzymes, which eventually influences the rotation direction of a flagellum.In this lesson, we will discuss how to model ligand-receptor binding.Ligand-receptor dynamics can be modeled by a reversible reactionThe chemical reactions that we have considered earlier in this course are irreversible, meaning they can only proceed in one direction. For example, in the prologue’s reaction-diffusion model, we modeled the reaction A + 2B  3B, but we did not consider the reverse reaction 3B  A + 2B.To model ligand-receptor dynamics, we will use a reversible reaction that proceeds continuously in both directions at possibly different rates. If a ligand collides with a receptor, then there is some probability that the two molecules will bind into a complex. At the same time, in any unit of time, there is also some probability that a bound receptor-ligand complex will dissociate into two separate molecules. The better suited a receptor is to a ligand, the higher the binding rate and the lower the dissociation rate. In a future module, we will discuss some of the biochemical details underlying what makes two molecules more or less likely to bind and disssociate.Note: You may be wondering why ligand-receptor binding is reversible. If complexes did not dissociate, then a brief increase in ligand concentration would be detected indefinitely by the surface receptors. Without releasing the bound ligands, the cell would need to manufacture more receptors, which are complicated molecules.We denote the ligand molecule by L, the receptor molecule by T, and the bound complex by LT. The reversible reaction representing complex binding and dissociation is L + T ←→ LT and consists of two reactions. The forward reaction is L + T  LT, which occurs at a rate depending on some rate constant kbind, and the reverse reaction is LT  L + T, which occurs at a rate depending on some rate constant kdissociate.If we start with a free floating supply of L and T molecules, then LT complexes will initially be formed quickly at the expense of the free-floating L and T molecules. The reverse reaction will not occur because of the lack of LT complexes. However, as the concentration of LT grows and the concentrations of  L and T decrease, the rate of increase in the concentration of LT will slow. Eventually, the number of LT complexes being formed by the forward reaction will balance the number of LT complexes being split apart by the reverse reaction. At this point, the concentration of all particles reaches equilibrium.Calculation of equilibrium in a reversible ligand-receptor reactionFor a single reversible reaction, if we know the rates of both the forward and reverse reactions, then we can calculate the steady state concentrations of L, T, and LT by hand.  Suppose that we begin with initial concentrations of L and T that are represented by l0 and t0, respectively. Let [L], [T], and [LT] denote the concentrations of the three molecule types. And assume that the reaction rate constants kbind and kdissociate are fixed.When the steady state concentration of LT is reached, the rates of the forward and reverse reactions are equal. In other words, the number of complexes being produced is equal to the number of complexes dissociating:kbind · [L] · [T] = kdissociate · [LT].We also know that by the law of conservation of mass, the concentrations of L and T are always constant across the system and are equal to their initial concentrations. That is, at any time point,[L] + [LT] = l0[T] + [LT] = t0.We solve these two equations for [L] and [T] to yield[L] = l0 - [LT][T] = t0 - [LT].We substitute the expressions on the right for [L] and [T] into our original steady state equation:kbind · (l0 - [LT]) · (t0 - [LT]) = kdissociate · [LT]We then expand of the left side of the above equation:kbind · [LT]2 - (kbind · l0 + kbind · t0) · [LT]  = kdissociate · [LT] + kbind · l0 · t0Finally, we subtract the right side of this equation from both sides:kbind · [LT]2 - (kbind · l0 + kbind · t0 + kdissociate) · [LT] + kbind · l0 · t0 = 0This equation may look daunting, but most of its components are constants. In fact, the only unknown is [LT], which makes this a quadratic equation, with [LT] as the variable.In general, a quadratic equation has the form a · x2 + b · x + c = 0 for a single variable x and constants a, b, and c. In our case, x = [LT], a = kbind, b = - (kbind · l0 + kbind · t0 + kdissociate), and c = kbind · l0 · t0. The quadratic formula  which you may have thought you would never use again  tells us that the quadratic equation has solutions for x given by\[x = \dfrac{-b \pm \sqrt{b^2 - 4 \cdot a \cdot c}}{2 \cdot a}\,.\]STOP: Use the quadratic formula to solve for [LT] in our previous equation and find the steady state concentration of LT. How can we use this solution to find the steady state concentrations of L and T as well?Now that we have reduced the computation of the steady state concentration of LT to the solution of a quadratic equation, we will compute this steady state concentration for a sample collection of parameters. Say that we are given the following parameter values (the units of these parameters are not important for this toy example):  kbind = 2;  kdissociate = 5;  l0 = 50;  t0 = 50.Substituting these values into the quadratic equation, we obtain the following:  a = kbind = 2  b = - (kbind · l0 + kbind · t0 + kdissociate) = -205  c = kbind · l0 · t0 = 5000That is, we are solving the equation 2 · [LT]2 - 205 · [LT] + 5000 = 0. Using the quadratic formula to solve for [LT] gives\([LT] = \dfrac{205 \pm \sqrt{205^2 - 4 \cdot 2 \cdot 5000}}{2 \cdot 2} = 51.25 \pm 11.25\).It would seem that this equation has two solutions: [LT] = 51.25 + 11.25 = 62.5 and [LT] = 51.25 - 11.25 = 40. Yet because l0 and t0, the respective initial concentrations of L and T, are both equal to 50, the first “solution” would imply that [L] = l0 - [LT] = 50 - 62.5 = -12.5 and [T] = t0 - [LT] = 50 - 62.5 = -12.5, which is impossible because the concentration of a particle cannot be negative.Now that we know the steady state concentration of LT must be 40, we can recover the values of [L] and [T] as[L] = l0 - [LT] = 10[T] = t0 - [LT] = 10.What if the forward reaction were slower (i.e., kbind were lower)? We would imagine that the equilibrium concentration of LT should decrease. For example, if we halve kbind, then we obtain the following adjusted parameter values:      a = kbind = 1        b = - (kbind · l0 + kbind · t0 + kdissociate) = -105        c = kbind · l0 · t0 = 2500  In this case, if we solve the quadratic equation for [LT], then we obtain\([LT] = \dfrac{105 \pm \sqrt{105^2 - 4 \cdot 1 \cdot 2500}}{2 \cdot 1} = 52.5 \pm 16.008\).The only feasible solution is 52.5-16.008 = 36.492; As anticipated, the steady state concentration has decreased.STOP: What do you think will happen to the steady state concentration of LT if its initial concentration (l0) increases or decreases? What if the dissociation rate (kdissociate) increases or decreases?  Confirm your predictions by changing these parameters and applying the quadratic formula to find the concentration of [LT].Where are the units?We have conspicuously not provided any units in the calculations above for the sake of simplicity, and so we will pause to explain what these units are. The concentration of a particle (whether it is L, T, or LT) is measured in molecules/µm3, the number of molecules per unit volume. But what about the binding and dissociation rates?When we multiply the binding rate constant kbind by the concentrations [L] and [T], the resulting unit should be in molecules/µm3 per second, which corresponds to the rate at which the concentration [LT] of complexes is increasing. If we let y denote the unknown units of kbind, theny · (molecules/µm3) · (molecules/µm3) = (molecules/µm3)s-1and solving for y givesy = ((molecules/µm3)-1)s-1.STOP: Use a similar argument to show that the units of the dissociation rate kdissociate should be s-1.Steady state ligand-receptor concentrations for an experimentally verified exampleHaving established the units in our model, we will solve our quadratic equation once more to identify steady state concentrations using experimentally verified binding and dissociation rates. The experimentally verified rate constant for the binding of receptors to glucose ligands is kbind = 0.0146 ((molecules/µm3)-1)s-1, and the dissociation rate constant is kdissociate = 35s-1.123 We will model an E. coli cell with 7,000 receptor molecules in an environment containing 10,000 ligand molecules. Using these values, we obtain the following constants a, b, and c in the quadratic equation:  a = kbind = 0.0146  b = - (kbind · l0 + kbind · t0 + kdissociate) = -283.2  c = kbind · l0 · t0 = 1022000When we solve for [LT] using the quadratic formula, we obtain [LT] = 4,793 molecules/µm3. Now that we have this value along with l0 and t0, we can solve for [L] and [T] as well:[L] = l0 - [LT] = 5,207 molecules/µm3[T] = t0 - [LT] = 2,207 molecules/µm3We can therefore determine the steady state concentration for a single reversible reaction. However, if we want to model real cellular processes, we will have many reactions for a variety of different particles. It will quickly become infeasible to solve all the resulting equations exactly. Instead, we need a method of simulating many reactions in parallel without incurring the significant computational overhead required to track the movements of every particle.Next lesson            Li M, Hazelbauer GL. 2004. Cellular stoichimetry of the components of the chemotaxis signaling complex. Journal of Bacteriology. Available online &#8617;              Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. &#8617;              Stock J, Lukat GS. 1991. Intracellular signal transduction networks. Annual Review of Biophysics and Biophysical Chemistry. Available online &#8617;      "
     
   } ,
  
   {
     
   } ,
  
   {
     
        "title"    : "Solutions",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/solutions",
        "date"     : "",
        "content"  : "How do E. coli respond to repellents?Exercise 1In contrast to that CheY phosphorylations decrease and tumbling becomes less frequent when the cell senses higher attractant concentrations, when the cell senses more repellents there should be more frequent tumbling. The decreased tumbling frequency should be a result of increased CheY phosphorylations. The cell should always be able to adapt to the current concentrations, therefore we also expect the CheY phosphoryaltions be restored when adpating.Exercise 2Update reaction rule for ligand-receptor binding fromBoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*0.2toBoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*5The complete code (you can download a completed BioNetGen file here: exercise_repel.bngl):begin modelbegin molecule types	L(t)             #ligand molecule	T(l,Phos~U~P)    #receptor complex	CheY(Phos~U~P)	CheZ()end molecule typesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	L0 5e3          #number of ligand molecules	T0 7000       #number of receptor complexes	CheY0 20000	CheZ0 6000	k_lr_bind 8.8e6/NaV2   #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_T_phos 15            #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV2    #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV2  #Z dephosphorylates Yend parametersbegin reaction rules	LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis	#Free vs. ligand-bound receptor complexes autophosphorylates at different rates	FreeTP: T(l,Phos~U) -&gt; T(l,Phos~P) k_T_phos	BoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*5	YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDeps: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephosend reaction rulesbegin species	L(t) L0	T(l,Phos~U) T0*0.8	T(l,Phos~P) T0*0.2	CheY(Phos~U) CheY0*0.5	CheY(Phos~P) CheY0*0.5	CheZ() CheZ0end speciesbegin observables	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules phosphorylated_CheA T(Phos~P)	Molecules bound_ligand L(t!1).T(l!1)end observablesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;3, n_steps=&gt;100})The simulation outputs:What if there are multiple attractant sources?Exercise 1:In molecule types and observables, update L(t) and T(l,r,Meth~A~B~C,Phos~U~P) to L(t,Lig~A~B) and T(l,r,Lig~A~B,Meth~A~B~C,Phos~U~P), where A and B represent the two ligand types. Update the reaction ruleLR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_distoL1R: L(t,Lig~A) + T(l,Lig~A) &lt;-&gt; L(t!1,Lig~A).T(l!1,Lig~A) k_lr_bind, k_lr_disL2R: L(t,Lig~B) + T(l,Lig~B) &lt;-&gt; L(t!1,Lig~B).T(l!1,Lig~B) k_lr_bind, k_lr_disAlso update the species by equally split the initial receptor concentrations by 2.You can download a completed BioNetGen file here: exercise_twoligand.bngl.Exercise 2:To wait for adaptation to ligand A, we could replace the forward reaction rate with this rule: rate constant = 0 unless after adapting to A. We could run the simulation without B first and observe the equilibrium methylation states, and use this for deciding whether the cell is adapted to A. (Why not equilibrium concentrations of free A?) One possible implementation is the following: replaceL1R: L(t,Lig~A) + T(l,Lig~A) &lt;-&gt; L(t!1,Lig~A).T(l!1,Lig~A) k_lr_bind, k_lr_disL2R: L(t,Lig~B) + T(l,Lig~B) &lt;-&gt; L(t!1,Lig~B).T(l!1,Lig~B) k_lr_bind, k_lr_diswithL1R: L(t,Lig~A) + T(l,Lig~A) &lt;-&gt; L(t!1,Lig~A).T(l!1,Lig~A) k_lr_bind, k_lr_disL2R: L(t,Lig~B) + T(l,Lig~B) &lt;-&gt; L(t!1,Lig~B).T(l!1,Lig~B) l2rate(), k_lr_disand l2rate() is a function defined as (remember to define it before reaction rules)begin functions	l2rate() = if(high_methyl_receptor&gt;1.2e3,k_lr_bind,0)end functionsThe complete code:begin modelbegin compartments  EC  3  100       #um^3  PM  2  1   EC    #um^2  CP  3  1   PM    #um^3end compartmentsbegin molecule types	L(t,Lig~A~B)	T(l,r,Lig~A~B,Meth~A~B~C,Phos~U~P)	CheY(Phos~U~P)	CheZ()	CheB(Phos~U~P)	CheR(t)end molecule typesbegin observables	Molecules bound_ligand L(t!1).T(l!1)	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules low_methyl_receptor T(Meth~A)	Molecules medium_methyl_receptor T(Meth~B)	Molecules high_methyl_receptor T(Meth~C)	Molecules phosphorylated_CheB CheB(Phos~P)end observablesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	miu 1e-6	L0 1e6	T0 7000	CheY0 20000	CheZ0 6000	CheR0 120	CheB0 250	k_lr_bind 8.8e6/NaV2   #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_TaUnbound_phos 7.5   #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV2    #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV2  #Z dephosphorylates Y	k_TR_bind 2e7/NaV2          #Receptor-CheR binding	k_TR_dis  1            #Receptor-CheR dissociaton	k_TaR_meth 0.08        #CheR methylates receptor complex	k_B_phos 1e5/NaV2      #CheB phosphorylation by receptor complex	k_B_dephos 0.17        #CheB autodephosphorylation	k_Tb_demeth 5e4/NaV2   #CheB demethylates receptor complex	k_Tc_demeth 2e4/NaV2   #CheB demethylates receptor complexend parametersbegin functions	l2rate() = if(high_methyl_receptor&gt;1.2e3,k_lr_bind,0)end functionsbegin reaction rules	L1R: L(t,Lig~A) + T(l,Lig~A) &lt;-&gt; L(t!1,Lig~A).T(l!1,Lig~A) k_lr_bind, k_lr_dis	L2R: L(t,Lig~B) + T(l,Lig~B) &lt;-&gt; L(t!1,Lig~B).T(l!1,Lig~B) l2rate(), k_lr_dis	#L3R: L(t,Lig~T) + T(l,Lig~O) &lt;-&gt; L(t!1,Lig~O).T(l!1,Lig~O) l2rate(), k_lr_dis	#Receptor complex (specifically CheA) autophosphorylation	#Rate dependent on methylation and binding states	#Also on free vs. bound with ligand	TaUnboundP: T(l,Meth~A,Phos~U) -&gt; T(l,Meth~A,Phos~P) k_TaUnbound_phos	TbUnboundP: T(l,Meth~B,Phos~U) -&gt; T(l,Meth~B,Phos~P) k_TaUnbound_phos*1.1	TcUnboundP: T(l,Meth~C,Phos~U) -&gt; T(l,Meth~C,Phos~P) k_TaUnbound_phos*2.8	TaLigandP: L(t!1).T(l!1,Meth~A,Phos~U) -&gt; L(t!1).T(l!1,Meth~A,Phos~P) 0	TbLigandP: L(t!1).T(l!1,Meth~B,Phos~U) -&gt; L(t!1).T(l!1,Meth~B,Phos~P) k_TaUnbound_phos*0.8	TcLigandP: L(t!1).T(l!1,Meth~C,Phos~U) -&gt; L(t!1).T(l!1,Meth~C,Phos~P) k_TaUnbound_phos*1.6	#CheY phosphorylation by T and dephosphorylation by CheZ	YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephos	#CheR binds to and methylates receptor complex	#Rate dependent on methylation states and ligand binding	TRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, k_TR_dis	TaRUnboundMeth: T(r!2,l,Meth~A).CheR(t!2) -&gt; T(r,l,Meth~B) + CheR(t) k_TaR_meth	TbRUnboundMeth: T(r!2,l,Meth~B).CheR(t!2) -&gt; T(r,l,Meth~C) + CheR(t) k_TaR_meth*0.1	TaRLigandMeth: T(r!2,l!1,Meth~A).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~B).L(t!1) + CheR(t) k_TaR_meth*30	TbRLigandMeth: T(r!2,l!1,Meth~B).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~C).L(t!1) + CheR(t) k_TaR_meth*3	#CheB is phosphorylated by receptor complex, and autodephosphorylates	CheBphos: T(Phos~P) + CheB(Phos~U) -&gt; T(Phos~U) + CheB(Phos~P) k_B_phos	CheBdephos: CheB(Phos~P) -&gt; CheB(Phos~U) k_B_dephos	#CheB demethylates receptor complex	#Rate dependent on methyaltion states	TbDemeth: T(Meth~B) + CheB(Phos~P) -&gt; T(Meth~A) + CheB(Phos~P) k_Tb_demeth	TcDemeth: T(Meth~C) + CheB(Phos~P) -&gt; T(Meth~B) + CheB(Phos~P) k_Tc_demethend reaction rulesbegin species	@EC:L(t,Lig~A) L0	@EC:L(t,Lig~B) L0	@PM:T(l,r,Lig~A,Meth~A,Phos~U) T0*0.84*0.9*0.5	@PM:T(l,r,Lig~A,Meth~B,Phos~U) T0*0.15*0.9*0.5	@PM:T(l,r,Lig~A,Meth~C,Phos~U) T0*0.01*0.9*0.5	@PM:T(l,r,Lig~A,Meth~A,Phos~P) T0*0.84*0.1*0.5	@PM:T(l,r,Lig~A,Meth~B,Phos~P) T0*0.15*0.1*0.5	@PM:T(l,r,Lig~A,Meth~C,Phos~P) T0*0.01*0.1*0.5	@PM:T(l,r,Lig~B,Meth~A,Phos~U) T0*0.84*0.9*0.5	@PM:T(l,r,Lig~B,Meth~B,Phos~U) T0*0.15*0.9*0.5	@PM:T(l,r,Lig~B,Meth~C,Phos~U) T0*0.01*0.9*0.5	@PM:T(l,r,Lig~B,Meth~A,Phos~P) T0*0.84*0.1*0.5	@PM:T(l,r,Lig~B,Meth~B,Phos~P) T0*0.15*0.1*0.5	@PM:T(l,r,Lig~B,Meth~C,Phos~P) T0*0.01*0.1*0.5	@CP:CheY(Phos~U) CheY0*0.71	@CP:CheY(Phos~P) CheY0*0.29	@CP:CheZ() CheZ0	@CP:CheB(Phos~U) CheB0*0.62	@CP:CheB(Phos~P) CheB0*0.38	@CP:CheR(t) CheR0end speciesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;700, n_steps=&gt;400})The simulation outputs:Exercise 3:Define ligand_center1 = [1500, 1500] and ligand_center2 = [-1500, 1500]. Since we are considering two gradients, we can add up the ligand concentration. We can replace our cal_concentraion(pos) function withdef calc_concentration(pos):    dist1 = euclidean_distance(pos, ligand_center1)    dist2 = euclidean_distance(pos, ligand_center2)    exponent1 = (1 - dist1 / origin_to_center) * (center_exponent - start_exponent) + start_exponent    exponent2 = (1 - dist2 / origin_to_center) * (center_exponent - start_exponent) + start_exponent    return 10 ** exponent1 + 10 ** exponent2Is the actual tumbling reorientation used by E. coli smarter than our model?Now, for sampling the new direction, we need to consider the past concentration and the current concentration the bacterium experiences. Since the new direction is also dependent on the last direction, we also need to record the current directions.Therefore, for our tumble_move() function, we would consider three inputs: curr_direction, curr_conc, past_conc. If the current concentration is higher than the past concentration, we sample the turning with mean of 1.19π-0.1π=1.09π and standard deviation of 0.63π; otherwise ample the turning with mean of 1.19π and standard deviation of 0.63π. The new direction is the sum of the turning and the past direction.Add the mean and standard deviation of turning as constants.#Constants for E.coli tumblingtumble_angle_mu = 1.19tumble_angle_std = 0.63We implement the tumble_move function as the following:def tumble_move(curr_dir, curr_conc, past_conc):    #Sample the new direction    corrent = curr_conc &gt; past_conc    if correct:        new_dir = np.random.normal(loc = tumble_angle_mu - 0.1, scale = tumble_angle_std)    else:        new_dir = np.random.normal(loc = tumble_angle_mu, scale = tumble_angle_std)    new_dir *= np.random.choice([-1, 1])    new_dir += curr_dir    new_dir = new_dir % (2 * math.pi) #keep within [0, 2pi]    projection_h = math.cos(new_dir) #Horizontal displacement for next run    projection_v = math.sin(new_dir) #Vertical displacement for next run    tumble_time = np.random.exponential(tumble_time_mu) #Length of the tumbling    return new_dir, projection_h, projection_v, tumble_timeUpdate the simulate function by replacingprojection_h, projection_v, tumble_time = tumble_move()withcurr_direction, projection_h, projection_v, tumble_time = tumble_move(curr_direction, curr_conc, past_conc)Can’t get enough BioNetGen?Exercise 1:You should know the molecules involved (molecule types), reactions and reaction rate constants (reaction rules), the initial conditions (species), the quantities you are interested in observing (observables), your simulation methods and time steps. Compartments and parameters should also be considered if applicable.Exercise 2:The complete code (you can download a completed BioNetGen file here: exercise_polymerization.bngl):begin modelbegin molecule types	A(h,t)end molecule typesbegin reaction rules	Initiation: A(h,t) + A(h,t) &lt;-&gt; A(h,t!1).A(h!1,t) 0.01,0.01	Polymerizationfree: A(h!+,t) + A(h,t) &lt;-&gt; A(h!+,t!1).A(h!1,t) 0.01,0.01	Polymerizationfree2: A(h,t) + A(h,t!+) &lt;-&gt; A(h,t!1).A(h!1,t!+) 0.01,0.01	Polymerizationbound: A(h!+,t) + A(h,t!+) &lt;-&gt; A(h!+,t!1).A(h!1,t!+) 0.01,0.01end reaction rulesbegin species	A(h,t) 1000end speciesbegin observables	Species A1 A==1	Species A2 A==2	Species A3 A==3	Species A5 A==5	Species A10 A==10	Species A20 A==20	Species ALong A&gt;=30end observablesend modelsimulate({method=&gt;"nf", t_end=&gt;50, n_steps=&gt;1000})The simulation outputs (note the concentrations are in log-scale):How to calculate steady state concentration in a reversible bimolecular reaction?Exercise 1:When the reaction begins, concentrations change toward the equilibrium concentrations. The system remains at the equilibrium state once reaching it.Exercise 2:Use [A], [B], [AB] to denote the equilibrium concentrations. At equilibrium concentrations, we havekbind · [A] · [B] = kdissociate · [AB].Because of conservation of mass, if the instead starts from no AB, our initial conditions will be a0 = b0 = 100, and ab0 = 0. (If we instead work from the “current” concentrations, a0 = b0 = 95, and ab0 = 5, how would you set up the calculations?)Similar as in the main text, Our original steady state equation can be modified tokbind · (a0 - [AB]) · (b0 - [AB]) = kdissociate · [AB].Solving this equation yields [AB] = 90.488.Exercise 3:If we add additional 100 A molecules to the system, more AB will be formed. If you use the equation setup in the solution above, we can simply update a0 = 200. [AB] = 99.019.If kdissociate = 9 instead of 3, less AB will be present at the equilibirum state. [AB] = 84.115.How to simulate a reaction step with the Gillespie algorithm?Exercise 1:Shorter because molecules collide to each other and react more frequently.Exercise 2:In this system, we have λ = 100. The probability that exactly 100 reaction happen in the next second is\[\mathrm{Pr}(X = 100) = \dfrac{\lambda^n e^{-\lambda}}{n!} = 0.03986\,.\]The expected wait time is 1/λ = 0.01.The probability that the first reaction occur after 0.02 second is\[\mathrm{Pr}(T &gt; 0.02) = e^{-\lambda t} = 0.1353\,.\]Exercise 3:At the beginning of the simulation, only one type of reaction could occur: L + T → LT. The rate of reaction is kbind[L][T] = 100molecule·s-1. Therefore we have λ = 100molecule·s-1, and the expected wait time is thus 1/λ = 0.01s·molecule-1.Although the expected wait time before the first reaction is considerably shorter than 0.1s, it is still possible for the first reaction to happen after 0.1s.After the first reaction, our system has 9 L, 9 T, and 1 LT molecules. There are two possible types of reactions to occur: the forward reaction L + T → LT and the reverse reaction LT → L + T. The rate of forward reaction is kbind[L][T] = 81molecule·s-1, while the rate of reverse reaction is kdissociate[LT] = 2molecule·s-1. The total reaction rate is 83molecule·s-1 and hence the expected wait time before the next reaction is 0.012s. The probability of forward reaction is 81molecule·s-1/83molecule·s-1 = 0.976, and the probability of reverse reaction is 0.0241."
     
   } ,
  
   {
     
        "title"    : "Analysis of Structural Protein Differences",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/structural_differences",
        "date"     : "",
        "content"  : "Visualizing a region of structural differencesIn the previous lesson, we identified a region between residues 476 and 485 of the SARS-CoV-2 spike protein that corresponds to a structural difference between the SARS-CoV-2 and SARS-CoV RBMs. We now wish to determine whether the differences we have found affect binding affinity with the human ACE2 enzyme.We will first use VMD to highlight the amino acids in the region of interest of the SARS-CoV-2 spike protein’s structure.Visit tutorialAnalyzing three sites of conformational differencesShang et al.1 identified three sites showing significant conformational differences between the SARS-CoV-2 and SARS-CoV spike proteins. We will discuss each of these three locations and see how they affect binding affinity between the spike protein and ACE2.Site 1: loop in ACE2-binding ridgeThe first location is our region of interest from the previous lesson and is found on a loop in a region called the ACE2 binding ridge. This region is shown in the figure below, in which SARS-CoV-2 is on left and SARS-CoV is on the right.Structural differences are challenging to show with a 2-D image, but if you followed the preceding tutorial, then we encourage you to use VMD to view the 3-D representation of the protein. Instructions on how to rotate a molecule and zoom in and out within VMD are given in our tutorial on finding local protein differences.STOP: See if you can identify the major structural difference between the proteins in the figure below. Hint: look at the yellow residue.A visualization of the loop in the ACE2-binding ridge that is conformationally different between SARS-CoV-2 (left) and SARS-CoV (right). The coronavirus RBD is shown at the bottom in purple, and ACE2 is shown at the top in green. Structural differences cause certain amino acid residues, which are highlighted in various colors and described in the main text, to behave differently when ACE2 contacts each of the two viruses.In what follows, we use a three-letter identifier for an amino acid followed by a number to indicate the identity of that amino acid followed by its position within the protein sequence. For example, the phenylalanine at position 486 of the SARS-CoV-2 spike protein will be called Phe486.The most noticeable difference between SARS-CoV-2 and SARS-CoV in this region relates to a “hydrophobic pocket” of the three hydrophobic ACE2 residues Met82, Leu79, and Tyr83. This pocket, which is colored silver in the above figure, is hidden away from the outside of the ACE2 enzyme to keep these amino acids away from water. In SARS-CoV-2, Phe486 (colored yellow) inserts itself into the pocket, favorably interacting with ACE2. These interactions do not happen with SARS-CoV, and its corresponding RBD residue, Leu472, is not inserted into the pocket.1Although the interaction with the hydrophobic pocket is the most critical difference between SARS-CoV-2 and SARS-CoV in this region, we highlight two other key differences. First, in SARS-CoV-2, a main-chain hydrogen bond forms between Asn487 (colored blue) and Ala475 (colored red), which creates a more compact ridge conformation, pushing the loop containing Ala475 closer to ACE2. This repositioning allows for the N-terminal residue Ser19 in ACE2 (colored cyan in the above figure) to form a hydrogen bond with the main chain of Ala475. Second, Gln24 in ACE2 (colored orange in the above figure) forms a new contact with the RBM.Site 2: hotspot 31Hotspot 31 is not a failed Los Angeles nightclub but rather our second region of notable conformational differences between SARS-CoV-2 and SARS-CoV, which was studied in SARS-CoV as early as 200823. This location earns its name because it involves a salt bridge between Lys31 and Glu35 in ACE2, which is colored red in the figure below,.STOP: Once again, see if you can spot the differences between SARS-CoV-2 and SARS-CoV.Visualizations of hotspot 31 in SARS-CoV-2 (left) and SARS-CoV (right). The coronavirus RBD is shown at the bottom in purple, and ACE2 is shown at the top in green. In SARS-CoV, hotspot 31 corresponds to a salt bridge (red), which is broken in SARS-CoV-2 to form a new hydrogen bond with Gln493 (blue)The figure above shows how the salt bridge differs in the two viruses. In SARS-CoV, shown on the right, the two residues point towards each other because in the RBM, Tyr442 (colored yellow) supports the salt bridge between Lys31 and Glu35 on ACE2. In SARS-CoV-2, shown on the left, the corresponding amino acid is the less bulky Leu455 (yellow), which provides less support to Lys31. This causes the salt bridge to break, so that Lys31 and Glu35 in ACE2 point in parallel toward Gln493 (colored blue) on the RBD, forming hydrogen bonds with the spike protein.1.Site 3: hotspot 353Finally, we consider hotspot 353, which involves another salt bridge, this one connecting Lys353 and Asp38 in ACE2. In this region, the difference between the residues is so subtle that it takes a keen eye to notice them in the figure below.Visualizations of hotspot 353 in SARS-CoV-2 (left) and SARS-CoV (right). The RBD is shown in purple, and ACE2 is shown in green. In SARS-CoV, the RBD residue Thr487 (yellow) stabilizes the salt bridge between ACE2 residues Lys 353 and Asp38 (red). In SARS-CoV-2, the corresponding RBD residue Asn501 (yellow) provides less support, causing ACE2 residue Lys353 (red residue on the left) to be in a slightly different conformation and form a new hydrogen bond with the RBD.1In SARS-CoV, the methyl group of Thr487 (colored yellow in the right figure above) supports the salt bridge on ACE2, and the side-chain hydroxyl group of Thr487 forms a hydrogen bond with the RBM backbone. The corresponding SARS-CoV-2 amino acid Asn501 (colored yellow in left figure) also forms a hydrogen bond with the RBM main chain. However, similar to what happened in hotspot 31, Asn501 provides less support to the salt bridge, causing Lys353 on ACE2 (colored red) to be in a different conformation. This allows Lys353 to form an extra hydrogen bond with the main chain of the SARS-CoV-2 RBM while maintaining the salt bridge with Asp38 on ACE2.1You may be wondering how researchers can be so fastidious that they would notice all these subtle differences between the proteins, even if they know where to look. The secret is that they have quantitative methods to aid their qualitative descriptions of how protein structure affects binding.Computing the energy of a bound complexIn part 1 of this module, we searched for the tertiary structure that best “explains” a protein’s primary structure by looking for the structure with the lowest potential energy (i.e., the one that is the most chemically stable).To quantify whether two molecules bind well, we will borrow this idea and compute the potential energy of the complex formed by the viral RBD and ACE2. If two molecules bind well, then the complex will have a very low potential energy. In turn, if we compare the SARS-CoV RBD-ACE2 complex against the SARS-CoV-2 RBD-ACE2 complex, and we find that the potential energy of the latter is significantly smaller, then we can conclude that it is more stable, providing evidence for the increased infectiousness of SARS-CoV-2.In the following tutorial, we will compute the energy of the bound spike protein-ACE2 complex for the two viruses and see how the three regions that we identified in the previous lesson contribute to the total energy of the complex. To do so, we will employ NAMD, a program that was designed for high-performance large system simulations of biological molecules and is most commonly used with VMD via a plugin called NAMD Energy. This plugin will allow us to isolate a specific region to evaluate how much this local region contributes to the overall energy of the complex.Visit tutorialDifferences in interaction energy with ACE2 between SARS and SARS-CoV-2The table below shows the interaction energies for each of our three regions of interest as well as the total energy of the RBD-ACE2 complex for both SARS-CoV and SARS-CoV-2. The overall attractive interaction energy between the RBD and ACE2 is lower for SARS-CoV-2 than for SARS-CoV, which supports previous studies that have found the SARS-CoV-2 spike protein to have higher affinity with ACE2.ACE2 interaction energies of the chimeric SARS-CoV-2 RBD (left) and SARS-CoV RBD (right). The PDB files contain two biological assemblies, or instances, of the corresponding structure. The first instance includes chain A (ACE2) and chain E (RBD), and the second instance includes chain B (ACE2) and chain F (RBD). The overall interactive energies between the RBD and ACE2 are shown in the first two rows (green). Remaining rows show interaction energies for regions of interest: the loop site (yellow), hotspot 31 (red), and hotspot 353 (gray). Total energy is computed as the sum of electrostatic interactions and van der Waals (vdW) forces.Furthermore, all three regions of interest have a lower total energy in SARS-CoV-2 than in SARS-CoV, with hotspot 31 (red) having the greatest negative contribution. We now have quantitative evidence that the conformational changes in the three sites do indeed increase the binding affinity between the spike protein and ACE2.Nevertheless, we should be careful with making inferences about the infectiousness of SARS-CoV-2 based on these results. To add evidence for our case, we would need biologists to perform additional experiments.Another reason for our cautiousness is that proteins are not fixed objects but rather dynamic structures whose shape is subject to small changes over time. We will now transition from the static study of proteins to the field of molecular dynamics, in which we simulate the movement of proteins’ atoms, along with their interactions as they move.Next lesson            Shang, J., Ye, G., Shi, K., Wan, Y., Luo, C., Aijara, H., Geng, Q., Auerbach, A., Li, F. 2020. Structural basis of receptor recognition by SARS-CoV-2. Nature 581, 221–224. https://doi.org/10.1038/s41586-020-2179-y &#8617; &#8617;2 &#8617;3 &#8617;4 &#8617;5              Li, F. 2008.Structural analysis of major species barriers between humans and palm civets for severe acute respiratory syndrome coronavirus infections. J. Virol. 82, 6984–6991. &#8617;              Wu, K., Peng, G., Wilken, M., Geraghty, R. J. &amp; Li, F. 2012. Mechanisms of host receptor adaptation by severe acute respiratory syndrome coronavirus. J. Biol. Chem. 287, 8904–8911. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Protein Structure Prediction is Difficult",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/structure_intro",
        "date"     : "",
        "content"  : "Laboratory methods for determining protein structureAlthough we would like to infer nature’s magic algorithm for inferring protein structure from amino acid sequence, biochemists can determine the structure of a protein experimentally. We will introduce two popular and sophisticated laboratory methods for accurately determining protein structure. We appeal to high-quality videos explaining them if you are interested.In X-ray crystallography, researchers crystallize many copies of a protein and then shine an intense beam of X-rays at the crystal. The light hitting the protein is diffracted, creating patterns from which the position of every atom in the protein can be inferred. If you are interested in learning more about X-ray crystallography, check out the following excellent two-part video series from The Royal Institution.            X-ray crystallography is over a century old and has been the de facto approach for protein structure determination for decades. Yet a newer method is now rapidly replacing X-ray crystallography.In cryo-electron microscopy (cryo-EM), researchers preserve thousands of copies of a protein in non-crystalline ice and then examine these copies with an electron microscope. Check out the following YouTube video from the University of California San Francisco for a detailed discussion of cryo-EM.      Unfortunately, laboratory approaches for structure determination are expensive and cannot be used on all proteins. An X-ray crystallography experiment for a single protein costs upward of $2,000, and building an electron microscope can cost millions. When applying X-ray crystallography, crystallizing a protein is a challenging task, and each copy of the protein must line up in the same way, which does not work for very flexible proteins. And to study bacterial proteins, we need to culture the bacteria in the lab, but microbiologists have estimated that fewer than 2% of bacteria can be cultured with current approaches.1Protein structures that have been determined experimentally are typically stored in the PDB, which contains over 160,000 protein structures. This number may seem large, but a recent study estimated that the 20,000 human genes translate into between 620,000 and 6.13 million protein isoforms (i.e., protein variants with slightly different structures).2 If we hope to catalog the proteins of all living things, then our work on structure determination is just beginning.Protein sequence and structure do not correlate wellThe prediction of protein structure from amino acid sequence is challenging because this prediction is fine-tuned with respect to some mutations but robust with respect to others. On the one hand, small perturbations in a protein’s sequence can drastically change the protein’s shape and even render it useless. On the other hand, different amino acids can have similar chemical properties, and so some sequence mutations will hardly change the structure of the protein. As a result, two very different amino acid sequences can fold into proteins having similar structure and comparable function.Continuing with our hemoglobin example, the following figure compares the sequences and structures of hemoglobin subunit alpha taken from three species: humans (PDB: 1si4), shortfin mako sharks (PDB: 3mkb), and emus (PDB: 3wtg). Hemoglobin is the oxygen-transport protein in the blood, consisting of two alpha “subunit” proteins and two beta subunit proteins that combine into a protein complex; because hemoglobin is well-studied and much shorter than the SARS-CoV-2 spike protein (the alpha and beta subunits are only 140 and 147 amino acids long, respectively), we will use it as an example throughout this module. The alpha subunits for the three species are markedly different in terms of amino acid sequence, and yet their 3-D structures are essentially identical.(Top) An amino acid sequence comparison of the first 40 (out of 140) amino acids of hemoglobin subunit alpha for three species: human, shortfin mako shark, and emu. A column is colored blue if all three species have the same amino acid, white if two species have the same amino acid, and red if all amino acids are different. Sequence identity calculates the number of positions in two amino acid sequences that share the same amino acid. (Bottom) Side by side comparisons of the 3-D structures of the three proteins. The final figure on the right superimposes the first three structures to highlight that they are virtually identical.Flexible polypeptide chains can fold into many possible structuresAnother reason why protein structure prediction is so difficult is because a polypeptide is very flexible, with the ability to rotate in multiple ways at each amino acid, which means that the polypeptide can fold into a staggering number of different shapes. This polypeptide flexibility  owes to the molecular structure of amino acids.As shown in the figure below, an amino acid comprises four parts. In the center, a carbon atom (called the alpha carbon) is connected to four different molecules: a hydrogen atom (H), a carboxyl group (–COOH), an amino group (-NH2), and a side chain (denoted “R” and often called an R group). The side chain is a molecule that differs between different amino acids and ranges in mass from a single hydrogen atom (glycine) up to -C8H7N (tryptophan).An amino acid consists of a central alpha carbon attached to a hydrogen atom, a side group, a carboxyl group, and an amino group.To form a polypeptide chain, consecutive amino acids are linked together during a condensation reaction in which the amino group of one amino acid is joined to the carboxyl group of another, while a water molecule (H2O) is expelled (see figure below).A condensation reaction joins two amino acids into a “dipeptide” by joining the amino group of one amino acid to the carboxyl group of the other, with a water molecule expelled. Source: https://bit.ly/3q0Ph8V.The resulting bond that is produced between the carbon atom of one amino acid’s carboxyl group and the nitrogen atom of the next amino acid’s amino group, called a peptide bond, is very strong. The peptide has very little rotation around this bond, which is almost always locked at 180°. As peptide bonds are formed between adjacent amino acids, the polypeptide chain takes shape, as shown in the figure below.A polypeptide chain formed of three amino acids.However, the bonds within an amino acid, joining the alpha carbon to its carboxyl group and amino group, are not as rigid, and the polypeptide is free to rotate around these two bonds. This rotation produces two angles of interest, called the phi angle (φ) and psi angle (ψ) (see figure below), which are formed at the alpha carbon’s connections to its amino group and carboxyl group, respectively.A polypeptide chain of multiple amino acids with the torsion angles φ and ψ indicated. The angle ω indicates the angle of the peptide bond, which is typically 180°. Image courtesy: Adam Rędzikowski.Below is an excellent video from Jacob Elmer illustrating how changing φ and ψ at a single amino acid can drastically reorient a protein’s shape.      A good analogy for polypeptide flexibility is the “Rubik’s Twist” puzzle, shown in the video below, which consists of a linear chain of flexible blocks that can form many different shapes.      A polypeptide with n amino acids has n - 1 peptide bonds, meaning that its shape is influenced by n - 1 phi angles and n - 1 psi angles. If each bond angle has k possible values, then the polypeptide has k2n-2 total possible conformations. If k is equal to 3 and n is equal to only 100 (representing a short polypeptide), then the number of potential protein structures is more than the number of atoms in the universe! The ability of the magic algorithm to reliably find a single conformation despite such an enormous number of potential shapes is called Levinthal’s paradox.3Although protein structure prediction is difficult, it is not impossible; the magic algorithm is not, after all, magic. But before discussing how we can solve this problem, we will need to learn a few more biochemical details and be more precise about two things. First, we should specify what we mean by the “structure” of a protein. Second, although we know that a polypeptide always folds into the same final three-dimensional shape, we have not said anything about why a protein folds in a certain way. We will therefore need a better understanding of how the physicochemical properties of amino acids influence a protein’s final structure.Next lesson            Wade W. 2002. Unculturable bacteria–the uncharacterized organisms that cause oral infections. Journal of the Royal Society of Medicine, 95(2), 81–83. https://doi.org/10.1258/jrsm.95.2.81 &#8617;              Ponomarenko, E. A., Poverennaya, E. V., Ilgisonis, E. V., Pyatnitskiy, M. A., Kopylov, A. T., Zgoda, V. G., Lisitsa, A. V., &amp; Archakov, A. I. 2016. The Size of the Human Proteome: The Width and Depth. International journal of analytical chemistry, 2016, 7436849. https://doi.org/10.1155/2016/7436849 &#8617;              Levinthal, C. 1969. How to Fold Graciously. Mossbaur Spectroscopy in Biological Systems, Proceedings of a meeting held at Allerton House, Monticello, Illinois. eds. Debrunner, P., Tsibris, J.C.M., Munck, E. University of Illinois Press Pages 22-24. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Classifying White Blood Cell Images",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/training",
        "date"     : "",
        "content"  : "Cross validationWe are nearly ready to apply k-NN to a dimension-reduced shape space of WBC nuclear images. However, we already know the correct class of every image in our dataset.One approach for assessing the performance of a classification algorithm when the class of each object is known is to exclude some subset of the data, called the validation set. After hiding the correct classes for elements of the validation set from the classification algorithm, we will measure how often the algorithm correctly identifies the class of each object in the validation set.STOP: What issues do you see with using a validation set?Yet it remains unclear which subset of the data we should use as a validation set. Random variation could cause the classifier’s accuracy to change depending on which subset we choose. Ideally, we would use a more democratic approach that is not subject to random variation and that uses all of the data for validation.In cross validation, we divide our data into a collection of f (approximately) equally sized groups called folds. We use one of these folds as a validation set, keeping track of which objects the classification algorithm classifies correctly, and then we start over with a different fold as our validation set. In this way, every element in our dataset will get used as a member of a validation set exactly once.A first attempt at quantifying the success of a classifierBefore we can apply cross validation to WBC images, we should discuss how to quantify the performance of the classifier. The table below shows the result of applying k-NN to the iris flower dataset, using k equal to 3 and cross validation with f equal to 10 (since there are 150 flowers, each fold contains 15 flowers). This table is called a confusion matrix, because it helps us visualize whether we are “confusing” the class assignment of an object.In the confusion matrix, rows correspond to true classes, and columns correspond to predicted classes. For example, consider the second row, which corresponds to the flowers that we know are Iris versicolor. k-NN predicted that none of these flowers were Iris setosa, that 47 of these flowers were Iris versicolor, and that three of these flowers were Iris virginica. Therefore, it correctly predicted the class of 47 of the 50 total Iris versicolor flowers.            Iris setosa      Iris versicolor      Iris virginica                  50      0      0              0      47      3              0      4      46      Note: We did not apply dimension reduction to the iris flower dataset because it has only four dimensions.We define the accuracy of a classifier as the fraction of objects that it correctly identifies out of the total. For the above iris flower example, the confusion matrix indicates that k-NN has an accuracy of (50 + 47 + 46)/150 = 95.3%.It may seem that accuracy is the only metric that we need. But if we were in a smarmy mood, then we might design a classifier that produces the following confusion matrix for our WBC dataset.            Granulocyte      Monocyte      Lymphocyte                  291      0      0              21      0      0              33      0      0      STOP: What is the accuracy of this classifier?Our clown classifier blindly assigned every image in the dataset to be a granulocyte, but its accuracy is 291/345 = 84.3%! To make matters worse, below is a confusion matrix for a hypothetical classifier on the same dataset that is clearly better but that has an accuracy of only (232 + 17 + 26)/345 = 79.7%.            Granulocyte      Monocyte      Lymphocyte                  232      25      34              2      17      2              6      1      26      The failure of this classifier to attain the same accuracy as the one assigning the majority class to each element owes to the WBC dataset having imbalanced classes, which means that reporting only a classifier’s accuracy may be misleading.For another example, say that we design a sham medical test for some condition that always comes back negative. If 1% of the population at a given point in time is positive for the condition, then we could report that our test is 99% accurate. But we would fail to get this test approved for widespread use because it never correctly identifies an individual who has the condition.STOP: What other metrics could we design to measure a classifier’s success?Recall, specificity, and precisionTo motivate our discussion of other classifier metrics, we will continue with the analogy of medical tests, which can be thought of as classifiers with two classes (positive or negative).First, we define some terms. A true positive is a positive test in a patient that has the condition; a false positive is a positive test in a patient that does not have the condition; a true negative is a negative test in a patient that does not have the condition; and a false negative is a negative test in a patient that does have the condition. The table below shows the locations of these four terms in the two-class confusion matrix for the test.The locations of true positives, false positives, true negatives, and false negatives in the confusion matrix associated with a medical test. Correct predictions are shown in green, and incorrect predictions are shown in red.In what follows, we will work with the confusion matrix for a hypothetical medical test shown in the figure below.A hypothetical medical test confusion matrix.STOP: What is the accuracy of this test? How does it compare to the accuracy of a test that returns negative for everyone in the population?Once again, this test has lower accuracy than one that returns negative for all individuals, but we will now show metrics for which it is superior.The recall (a.k.a. sensitivity) of a two-class classifier is the percentage of positive cases that the test correctly identifies, or the ratio of true positives over the sum of the true positives and false negatives (found by summing the top row of the confusion matrix). For the confusion matrix in the table above, the recall is 1,000/(1,000 + 500) = 66.7%. Recall ranges from 0 to 1, with larger values indicating that the test is “sensitive”, meaning that it can identify true positives from a pool of patients who actually are positive.The specificity of a test is an analogous metric for patients whose actual status is negative. Specificity measures the ratio of true negatives to the sum of true negatives and false positives (found by summing the second row of the confusion matrix). The hypothetical medical test has specificity equal to 198,000/(198,000 + 2,000) = 99%.Finally, the precision of a test is the percentage of positive tests that are correct, or the ratio of true positives to the total number of positive tests (found by summing the first column of the confusion matrix). For example, the precision of our hypothetical medical test is 1,000/(1,000 + 2,000) = 33.3%.STOP: How could we trick a test to have recall, specificity, or precision close to 1?Just like accuracy, each of the above three of metrics is imperfect on its own and can be fooled by a frivolous test that always returns positive or negative. However, a frivolous test cannot score well on all these metrics at the same time. Therefore, in practice we will examine all these metrics, as well as accuracy, when assessing the quality of a classifier.STOP: Consider a dataset of 201,500 patients, 1,500 of whom have a condition. Compute the recall, specificity, and precision of a medical test that always returns negative. How do these metrics compare against those of our hypothetical test?You may find all these terms difficult to keep straight. You are not alone! An entire generation of scientists make copious trips to the Wikipedia page describing these and other classification metrics. After all, it’s called a confusion matrix for a reason…            Extending classification metrics to multiple classesBefore we return to our example of classifying images of WBC nuclei, we need to extend the ideas discussed in the preceding section to handle more than two classes. To do so, we consider each class individually and treat this class as the “positive” case and all other classes together as the “negative” case.We use the iris flower dataset to show how this works. Say that we wish to compute the recall, specificity, and precision for Iris virginica using the k-NN confusion matrix that we generated, reproduced below.            Iris setosa      Iris versicolor      Iris virginica                  50      0      0              0      47      3              0      4      46      We can simplify this confusion matrix into a two-class confusion matrix that combines the two classes corresponding to the other two species. The figure below shows this smaller confusion matrix, with Iris virginica moved to the first row and column.            Iris virginica      Iris setosa and Iris versicolor                  46      4              3      97      This simplification allows us to compute classification statistics with respect to Iris virginica.  recall: 46/(46+4) = 92%  specificity: 97/(3+97) = 94%  precision: 46/(46+3) = 93.9%Once we have computed a statistic for each of the three iris species, we can then obtain a statistic for the classifier as a whole by taking the average of the statistics over all three species. For example, the overall recall of the classifier shown in the k-NN iris flower confusion matrix is the average of the recall for Iris setosa, Iris versicolor, and Iris virginica (which we just computed to be 92\%). We leave the computation of the other two recall values as an exercise.STOP: Compute the recall, specificity, and precision for each of Iris setosa and Iris versicolor using the k-NN confusion matrix. Then, average each statistic over all three species to determine the classifier’s overall recall, specificity, and precision.Let us consider one more example, returning to the confusion matrix for a hypothetical classifier on our WBC image dataset, reproduced below.            Granulocyte      Monocyte      Lymphocyte                  232      25      34              2      17      2              6      1      26      The table below shows the computation of average recall, specificity, and precision for this classifier, which we previously mentioned has an accuracy of 79.7%.                   Count      Recall      Specificity      Precision                  Granulocyte      291      79.725%      85.185%      96.667%              Monocyte      21      80.952%      91.975%      39.535%              Lymphocyte      33      78.788%      88.462%      41.935%              Average             79.822%      88.541%      59.379%              Weighted Average             79.710%      85.912%      87.954%      Note that the precision statistics for monocytes and lymphocytes weigh down the overall average precision of 59.379%. As a result, when we have imbalanced classes, we will also report the weighted average of the statistic, weighted over the number of elements in each class (see the final row in the above table). For example, the weighted average of the precision statistic is (291 · 96.667% + 21 · 39.535% + 33 · 41.935%)/(291 + 21 + 33) = 87.954%.Now that we understand more about how to quantify the performance of a classifier, we are ready to apply k-NN to our WBC shape space (post-PCA of course!) and then assess its performance.Visit tutorialApplying a classifier to the WBC shape spaceThe confusion matrix shown below is the result of running k-NN on our WBC nuclear image shape space, using d (the number of dimensions in the PCA hyperplane) equal to 10, k (the number of nearest neighbors to consider when assigning a class) equal to 1, and f (the number of folds in cross validation) equal to 10.            Granulocyte      Monocyte      Lymphocyte                  259      9      23              14      6      1              5      2      26      For these parameters, k-NN has an accuracy of 84.3% and a weighted average of recall, specificity, and precision of 84.3%, 69.4%, and 85.7%, respectively, as shown in the following table.                   Count      Recall      Specificity      Precision                  Granulocyte      291      89.003%      64.815%      93.165%              Monocyte      21      28.571%      96.605%      35.294%              Lymphocyte      33      78.788%      92.308%      52.000%              Average             65.454%      84.576%      60.153%              Weighted Average             84.348%      69.380%      85.705%      If you explored the preceding tutorial, then you may wish to verify that these three values of d, k, and f appear to be close to optimal, in that changing them does not improve our classification metrics. We should ask why this is the case.We start with d. If we set d too large, then once again the curse of dimension strikes. Using d equal to 344 (with k equal to 1 and f equal to 10) produces the baffling confusion matrix below, in which every element in the space is somehow closest to a lymphocyte.            Granulocyte      Monocyte      Lymphocyte                  0      0      291              0      0      21              0      0      33      Using d = 3, we obtain better results, but we have reduced the dimension so much that we start to lose the signal in the data.            Granulocyte      Monocyte      Lymphocyte                  257      15      19              16      5      0              20      0      13      We next consider k. It might seem that taking more neighbors into account would be helpful, but because of the class imbalance toward granulocytes, the effects of random noise mean that as we increase k, we will start considering granulocytes that just happen to be relatively nearby. For example, when k is equal to 5, every monocyte is classified as a granulocyte, as shown in the confusion matrix below (with d equal to 10 and f equal to 10).            Granulocyte      Monocyte      Lymphocyte                  264      1      26              21      0      0              7      0      26      The question of the number of folds, f, is trickier. Increasing this parameter does not change the confusion matrix much, but in general, if we use too few folds (i.e., if f is too small), then we ignore too many known objects’ classes.Yet we still have a problem. Although k-NN can identify granulocytes and lymphocytes quite well, it performs poorly on monocytes because of the class imbalance in our dataset. We have so few monocytes that encountering another one in the shape space simply does not happen often.Statisticians have devised a variety of approaches to address class imbalance. We could undersample our data by excluding a random sample of the granulocytes. Undersampling works better when we have a large amount of data, so that throwing out some of the data does not cause problems. In our case, our dataset is small to begin with, and undersampling would risk plummeting the classifier’s performance on granulocytes.We could also try using a different classification algorithm. One idea is to use a cost-sensitive classifier that charges a variable penalty for assigning an element to the wrong class, and then minimizes the total cost over all elements. For example, classifying a monocyte as a granulocyte would receive a greater penalty than classifying a granulocyte as a monocyte. A cost-sensitive classifier would help increase the number of images that are classified as monocytes, although it would also incorporate incorrectly classified monocyte.Yet ultimately, k-NN outperforms much more advanced classifiers on this dataset. It may be a relatively simple approach, but k-NN offers a great match for classifying images within a WBC shape space, since proximity in this space indicates that two WBCs belong to the same family.Limitations of our WBC image classification pipelineEven though k-NN performed reasonably well on our WBC image dataset, we can still make improvements. After all, our model requires a number of steps from the intake of data to their ultimate classification, which means that several potential failure points could arise.We will start with data. If we have great data, then a relatively simple approach will probably give a great result, but if we have bad data, then no amount of algorithmic wizardry will save us. An example of “good data” is the iris flower dataset; the features chosen were measured precisely and differentiate the flowers so clearly that it almost seems silly to run a classifier.In our case, we have a small collection of very low resolution WBC images, which limits the performance of any classifier before we begin. Yet these data limitations are a feature of this chapter, not a bug, as they allow us to highlight a very common issue in data analysis. Now that we have built a classification pipeline, we should look for a larger dataset with higher-resolution images less class imbalance.The next failure point in our model is our segmentation pipeline. Earlier in the module, we saw that this pipeline did not perfectly segment the nucleus from every image, sometimes capturing only a fragment of the nucleus. Perhaps we could exclude an image from downstream analysis if the segmented nucleus is below some threshold size.We then handed off the segmented images to CellOrganizer to build a shape space from the vectorized boundaries of the nuclei. Even though CellOrganizer does what we tell it to do, the low resolution of the nuclear images will mean that the vectorization of each nuclear image is noisy.But even if we use higher resolution images and adjust our segmentation pipeline, we are still only building a model from the shape of the nucleus. We didn’t even take the size of the nucleus into account! If we return to the three sample WBC images from the introduction, reproduced in the figure below, then we can see that the lymphocyte nucleus is much smaller than the other two nuclei, which is true in general. When vectorizing the images, we could reserve an additional coordinate of each image’s vector for the size (in pixels) of the segmented nucleus. This change would hopefully help improve the performance of our classifier, especially on lymphocytes.                                                                                                        Three images from the blood cell image dataset showing a granulocyte (left), a monocyte (center), and a lymphocyte (right).  STOP: What other quantitative features could we extract from our images?Finally, we discuss the classification algorithm itself. We used k-NN because it is intuitive to newcomers, but perhaps a more complicated algorithm could peer deeper into our dataset to find more subtle signals.Ultimately, obtaining even moderate classification performance is impressive given the quality and size of our dataset, and the fact that we only modeled the shape of each cell’s nucleus. It also makes us wonder if we could improve this performance if we had access to a very high-quality dataset or a higher-powered computational approach. In this chapter’s conclusion, we discuss the foundations of an approach that not only constitutes the best known solution for WBC image classification but that is taking over the world.Next lesson"
     
   } ,
  
   {
     
        "title"    : "Transcription and DNA-Protein Binding",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/transcription",
        "date"     : "",
        "content"  : "The central dogma of molecular biologyDNA is a double-stranded molecule consisting of the four nucleobases adenine, cytosine, guanine, and thymine; the sum total of a cell’s DNA constitutes its genome. A gene is a region of an organism’s DNA that is transcribed into a single-stranded RNA molecule in which thymine is converted to uracil and the other bases remain the same.The RNA transcript is then translated into the amino acid sequence of a protein. Because there are four different nucleobases but twenty amino acids, RNA is translated in codons, or triplets of nucleobases, according to a mapping called the genetic code (see figure below).The genetic code, which dictates the conversion of RNA codons into amino acids. Codons are read from the inside of the figure outward. Image courtesy J_Alves, Open Clip Art.DNA can therefore be thought of as a blueprint for storing information that flows from DNA to RNA to protein. This flow of information is called the central dogma of molecular biology, illustrated in the figure below.Note: Like any dogma, there are times in which the central dogma of molecular biology is violated. If you are interested in an example, consider Chapter 4 of Bioinformatics Algorithms.The central dogma of molecular biology states that genetic information flows from DNA in the nucleus, into the RNA that is transcribed from DNA, and then into proteins that are translated from RNA and that then serve some purpose in the cell.Transcription factors control gene regulationAll your cells have essentially the same DNA, and yet your liver cells, heart cells, and brain cells serve different functions. This is because the rates at which your genes are regulated, or converted into RNA and then protein, vary for different cell types and in response to different stimuli.Gene regulation typically occurs at either the DNA or protein level. At the DNA level, regulation is modulated by transcription factors, master regulator proteins that typically bind to the DNA immediately preceding a gene and serve to either activate or repress the gene’s rate of transcription, turning that rate up or down, respectively.Because of the central dogma, transcription factors are involved in a feedback loop. DNA is transcribed into RNA, which is translated into the protein sequence of a transcription factor, which then binds to the upstream region of a gene and changes its rate of transcription.Transcription factors are vital for the cell’s response to its environment because extracellular stimuli can activate a transcription factor via a system of signaling molecules that convey a signal through relay molecules to the transcription factor (see figure below). Only when the transcription factor is activated will it regulate its target gene(s).A cell receiving a signal which triggers a response in which this signal is “transduced” into the cell, resulting in transcription of a gene. We will discuss signal transduction in greater detail in a future module.1In module 2, we will discuss the details of how the cell detects an extracellular signal and conveys it as a response within the cell. For now, we will focus on the relationship between transcription factors and the genes that they regulate.Determining if a transcription factor regulates a given geneA transcription factor has a weak binding affinity to DNA in general, but it has a very strong binding affinity for a single specific short sequence of nucleotides2 called a sequence motif. Think of a transcription factor as latching onto DNA and then sliding up and down the DNA molecule until it finds its target motif, where it clamps down. If this motif occurs immediately before a gene, then the transcription factor will regulate this gene.Note: The astute reader will notice that we have already used the term “motif” in two different contexts, first to mean both a recurring network substructure and now to mean a sequence of nucleotides to which a transcription factor binds. This sequence is called a “motif” because the transcription factor may regulate multiple genes, so that the binding sequence will occur immediately before most or all of these genes.A natural question, then, is to find the set of genes to which a transcription factor binds. A common experiment answering this question is called ChIP-seq3, which is short for chromatin immunoprecipitation sequencing. This approach, which is illustrated in the figure below, combines an organism’s DNA with multiple copies of a protein of interest that binds to DNA (which in this case would be a transcription factor). After allowing for the proteins to bind naturally to the DNA, the DNA is cleaved into much smaller fragments of a few hundred base pairs. As a result, we obtain a collection of DNA fragments, some of which are attached to a copy of our protein of interest.The question is how to isolate the fragments of DNA that are bound to a transcription factor of interest, and the clever trick is to use an antibody. Normally, antibodies are produced by the immune system to target foreign pathogens. The antibody used by ChIP-seq is designed to bind to our protein of interest, and the antibody is attached to a bead. Once the antibody attaches to the protein target, a complex is formed consisting of the DNA fragment, the protein bound to the DNA, the antibody bound to the protein, and the bead attached to the antibody. Because the bead weighs down these complexes, they can be filtered out as preciptate from the solution, and we are left with just the DNA fragments that are bound to our protein.In a final step, the protein is unlinked from the DNA, leaving a collection of DNA fragments that were previously bound to a single protein. Each fragment is read using DNA sequencing to determine its order of nucleotides, which is then queried against the genome to determine the gene(s) that the fragment precedes. When the protein is a transcription factor, we can therefore hypothesize that these are the genes that the transcription factor regulates!An overview of ChIP-seq. Figure courtesy Jkwchui, Wikimedia Commons user.If you would like a different explanation of  may also like to check out the following excellent video on identifying genes regulated by a transcription factor. This video was produced by students in the 2020 PreCollege Program in Computational Biology at Carnegie Mellon. The presenters won an award from their peers for their work, and for good reason!      STOP: How do you think that researchers could determine whether a transcription factor activates or represses a given gene?As a result of techniques like ChIP-seq, researchers have learned a great deal about which transcription factors regulate which genes. The key is to organize the relationships between transcription factors and the genes that they regulate in a way that will help us identify patterns in these relationships.Next lesson            CC https://www.open.edu/openlearn/science-maths-technology/general-principles-cellular-communication/content-section-1 &#8617;              Goodsell, David (2009), The Machinery of Life. Copernicus Books. &#8617;              Johnson, D. S., Mortazavi, A., Myers, R. M., &amp; Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science, 316(5830), 1497–1502. https://doi.org/10.1126/science.1141319 &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Adding Directionality to Spike Protein GNM Simulations Using ANM",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_ANM",
        "date"     : "",
        "content"  : "In this tutorial, we will use Normal Mode Wizard (NMWiz), a VMD plugin that serves as a GUI for ProDy, to perform ANM analysis on the SARS-CoV-2 RBD. We will visualize the results in a cross-correlation map and square fluctuation plot and then produce ANM animations showing the predicted range of motion of the SARS-CoV-2 spike RBD.Before starting, make sure that you have installed VMD and know how to load molecules into the program. If you need a refresher, visit the Multiseq tutorial.First, load the SARS-CoV-2 spike protein/ACE2 enzyme complex (6vw1) into VMD. Then, start up NMWiz by clicking Extensions &gt; Analysis &gt; Normal Mode Wizard.A small window will open. Select ProDy Interface.We want to focus on the RBD of SARS-CoV-2, so we need to choose a new selection. In ProDy Interface, change Selection to protein and chain F and click Select. Next, make sure that ANM calculation is selected for ProDy job:. Check the box for write and load cross-correlations heatmap. Finally, click Submit Job.Note: Let the program run and do not click any of the VMD windows, as this may cause the program to crash or become unresponsive. The job can take between a few seconds and a few minutes. When the job is complete, you will see a new window NMWiz - 6vw1_anm ... and the cross=correlation heatmap appear.Now that the ANM calculations are completed, you will see the visualization displayed in VMD Main. Disable the original visualization of 6vw1 by double-clicking on the letter D. The letter will change red to indicate that it has been disabled.In OpenGL Display, you can see the protein with numerous arrows representing the calculated fluctuations.To visualize the protein movements as described by the arrows, we need to create the animation. Return to the NMWiz - 6vw1_anm... window and click Make next to Animations.VMD Main should now show a new row for the animation.The animation should also be visible in OpenGL Display. However, the previous visualizations are somewhat in the way. We can disable them in the same way as before by double-clicking the letter D.You should now be able to see the animation of the ANM fluctuations of 6vw1, as shown in the figure below.  We now will return to the main text and interpret our results.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Integrating Molecular Dynamics Analyses with DynOmics",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_DynOmics",
        "date"     : "",
        "content"  : "In this tutorial, we will be using a publicly available web server, DynOmics, produced by Dr. Hongchun Li and colleagues at the University of Pittsburgh School of Medicine. This server is dedicated to performing molecular dynamics analysis by integrating the GNM and ANM models that we learned about in the main text.Navigate to the DynOmics homepage. This page contains many options that we can change to customize our analysis, but we will keep the default options for now.To choose our target molecule, we need to input the PDB ID. Since we will be performing the analysis on the SARS-CoV-2 spike protein, enter 6vxx under PDB ID. Then, click Submit.Once the analysis is complete, you will see all the ANM and GNM results listed next to an interactive visualization of the protein. In addition, the visualization is colored based on the predicted protein flexibility from the slow mode analysis.Let’s start exploring some of the results by clicking Molecular Motions - Animations. This shows an interactive animation of the protein with the same coloring as before and showing the predicted motion of protein fluctuation based on ANM calculations. On the right, we can customize the animation by changing the vibrations and vectors to make the motions more pronounced.We can also change the Mode index. We learned in the main text that the motion of protein fluctuations can be broken down into a collection of individual modes. By changing the Mode index, we can see the different contribution of each mode to the motion. The lower the index of the mode, the greater this mode contributes to the square fluctuations of the protein’s residues.We can also download the calculations as a .nmd file and visualize it in VMD. If you are interested in using VMD, open the software and go to Extensions &gt; Analysis &gt; Normal Mode Wizard. Then, click Load NMD File and select the .nmd file that you downloaded. Now that the ANM calculation is loaded into VMD, you can customize the visualization and recreate the animation by clicking Animation: Play.Next, return to DynOmics and click Mean-Square Fluctuations of Residues. On this page, you will see two visualizations of the protein, labeled Theoretical B-Factors and Experimental B-Factors as well as the B-factor plot. Recall that theoretical B-factors are calculated during GNM analysis, whereas experimental B-factors are included in the PDB. On the bottom, we can see the plot of the B-factors across the entire protein split into chains.The next result page we will visit is Selected Modes - Color-coded Diagrams. On this page, we can see the shape of each individual mode, or an average of the “slowest” two, three, or ten modes. As we saw earlier in the main text, we can see a wide peak that corresponds to the RBD of the spike protein. Clicking the plot highlights the residue on the interactive visualizations.Next, click Cross-correlations between Residue Fluctuations, which shows the full cross-correlation heat map that we produced in the GNM tutorial.Click Inter-residue Contact Map, which shows a visualization of the connected alpha carbon network based on the threshold distance, along with a contact map (called a “connectivity map”). The default threshold distance is set to 7.3 angstroms; to change the threshold, we need to perform the calculations again after changing the cutoff distance in Advanced options.There is plenty more to say about the additional results produced by DynOmics; if you are interested in these results, please check out the DynOmics tutorial.If you have made it this far, congratulations! You have become an expert in protein analysis. We will now head back to the main text to wrap up this module with some concluding thoughts.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Analyzing Coronavirus Spike Proteins Using GNM",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_GNM",
        "date"     : "",
        "content"  : "In this tutorial, we will be performing GNM calculations on one of the chains in the SARS-CoV-2 spike protein (6vxx) and then visualizing the results using the plots that we discussed in the main text.We will be using ProDy, an open-source Python package that allows users to perform protein structural dynamics analysis. Its flexibility allows users to select specific parts or atoms of the structure for conducting normal mode analysis and structure comparison.First, please install the following software:  Python (2.7, 3.5, or later)  ProDy  NumPy  Biopython  IPython  MatplotlibWe recommend creating a workspace for storing files when using ProDy or storing protein .pdb files. Open a terminal window and navigate to this workspace before starting IPython with the following command.ipython --pylabYou will now need to import functions that we will use in this tutorial.In[#]: from pylab import *In[#]: from prody import *In[#]: ion()Finally, we turn on interactive mode (you only need to do this once per session).In[#]: ion()Next, we will parse in 6vxx.pdb and set it as the variable spike.In[#]: spike = parsePDB('6vxx.pdb')For this GNM calculation, we will focus only on the alpha carbons of Chain A. We can access these atoms using the spike.select function as follows, storing the alpha carbons in a variable calphas.In[#]: calphas = spike.select('calpha and chain A')Recall in the main text that we converted a protein into a network of nodes and springs, where two nodes are connected by an edge if the alpha carbons corresponding to these nodes are within some threshold distance \(r_c\). We can also represent this network with a matrix called the  Kirchhoff matrix \(\Gamma\) and constructed as follows:\[\Gamma_{ij} = \begin{cases} &amp; -1 \text{ if $i \neq j$ and $R_{ij} \leq r_c$}\\ &amp;  0 \text{ if $i \neq j$ and $R_{ij} &gt; r_c$} \end{cases}\]\[\Gamma_{ii} = -\sum_j \Gamma_{ij}\]In other words, if residue i and j are connected to each other in the network, then the value of the (i, j)-th entry in the matrix will be -1. If these two residues are not connected, the the entry will be 0. The (i, j)-th entry on the main diagonal of \(\Gamma\), is equal to the total number of connections of residue i. The figure below shows an example Kirchhoff matrix for a small network.An example network (left) with its the corresponding Kirchhoff matrix (right).The Kirchhoff matrix is helpful because applying some matrix algebra to it (specifically, determining its eigenvector decomposition) allows us to estimate the inner products \(\langle \Delta R_i, \Delta R_j \rangle\) that power the GNM model.Returning to our SARS-CoV-2 example, we are now ready to build the Kirchhoff matrix. You can pass parameters for the cutoff (threshold distance between atoms) and gamma (spring constant). The defaults are 10.0 angstroms and 1.0, respectively. Here, we will set the cutoff to be 20.0 Å.In[#]: gnm = GNM('SARS-CoV-2 Spike (6vxx) Chain A Cutoff = 20.0 A')In[#]: gnm.buildKirchhoff(calphas, cutoff=20.0)For normal mode analysis with ProDy, the default is 20 non-zero modes. In addition, we will create hinge sites for later use in the slow mode shape plot; these sites represent locations in the protein where fluctuations change in relative directions.In[#]: gnm.calcModes()In[#]: hinges = calcHinges(gnm)For advanced users, information of the GNM and Kirchhoff matrix can be accessed with the following commands.In[#]: gnm.getEigvals()In[#]: gnm.getEigvecs()In[#]: gnm.getCovariance()#To get information specifically on the slowest mode (which is always indexed at 0):In[#]: slowMode = gnm[0]In[#]: slowMode.getEigval()In[#]: slowMode.getEigvec()We have now successfully initialized our GNM model and are ready to generate our plots. In what follows, make sure to save your visualization (if desired) and close each plot before creating another. We will discuss how to interpret these plots back in the main text.First, we produce a contact map, which we introduced earlier in the module.In[#]: showContactMap(gnm);This command should produce the following plot.Next, we produce a cross-correlation plot with the following command.In[#]: showCrossCorr(gnm);The plot is found below.Finally, we use the following command to produce a shape plot for the slowest mode identified by GNM.In[#]: showMode(gnm[0], hinges=True)In[#]: grid();This mode shape plot is shown in the figure below.Now that we have produced our plots, we are ready to head back to the main text and analyze our results.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Computing the Energy Contributed by a Local Region of the SARS-CoV-2 Spike Protein Bound with the Human ACE2 Enzyme",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_NAMD",
        "date"     : "",
        "content"  : "In this tutorial, we will show how to use NAMD Energy to calculate the interaction energy for a bound complex, as well as to determine how much a given region of this complex contributes to the overall potential energy. We will use the chimeric SARS-CoV-2 RBD-ACE2 complex (PDB entry: 6vw1) and compute the interaction energy contributed by the loop site that we identified as a region of structural difference in a previous lesson.To determine the energy contributed by a region of a complex, we will need a “force field”, an energy function with a collection of parameters that determine the energy of a given structure based on the positional relationships between atoms. There are many different force fields depending on the specific type of system being studied (e.g. DNA, RNA, lipids, proteins). There are many different approaches for generating a force field; for example, Chemistry at Harvard Macromolecular Mechanics (CHARMM)1 offers a popular collection of force fields.To get started, make sure to have installed VMD and know how to load molecules into the program; if you need a refresher, visit the VMD and Multiseq Tutorial. Then, download NAMD; you may be asked to provide the path to your NAMD installation.Creating a protein structure fileNAMD needs to use the information in the force field to calculate the potential energy of a protein. To do this, it needs a protein structure file (PSF). A PSF, which is molecule-specific, contains all the information required to apply a force field to a molecular system.2 Fortunately, there are programs that can generate a PSF given a force field and a .pdb file containing a protein structure. See this NAMD tutorial for more information.First, load 6vw1 into VMD. We then need to create a protein structure file of 6vw1 to simulate the molecule. We will be using the VMD plugin Atomatic PSF Builder to create the file. From VMD Main, click Extensions &gt; Modeling &gt; Automatic PSF Builder.In the AutoPSF window, make sure that the selected molecule is 6vw1.pdb and that the output is 6vw1_autopsf. Click Load input files. In step 2, click Protein and then Guess and split chains using current selections. Then click Create chains and then Apply patches and finish PSF/PDB.Note: It may be the case that NAMD hangs when attempting to guess and split chains. If so, we are providing the PSF files here. During this process, you may see an error message stating MOLECULE DESTROYED. If you see this message, click Reset Autopsf and repeat the above steps. The selected molecule will change, so make sure that the selected molecule is 6vw1.pdb when you start over. Failed molecules remain in VMD, so deleting the failed molecule from VMD Main is recommended before each new attempt.../tutorials/NAR_compare_equal.blendIf the PSF file is successfully created, then you will see a message stating Structure complete. The VMD Main window also will have an additional line.Using NAMD Energy to compute the energy of the SARS-CoV-2 RBD loop regionNow that we have the PSF file, we can proceed to NAMD Energy. In VMD Main, click Extensions &gt; Analysis &gt; NAMD Energy. The NAMDEnergy window will pop up. First, change the molecule to be the PSF file that we created.We now want to calculate the interaction energy between the RBD and ACE2. Recall that the corresponding chain pairs are chain A (ACE2)/chain E (RBD) and chain B (ACE2)/chain F (RBD). As we did in the previous tutorial, we will use the chain B/F pair. Put protein and chain B and protein and chain F for Selection 1 and Selection 2, respectively.Next, we want to calculate the main protein-protein interaction energies, divided over electrostatic and van der Waals forces. Under Output File, enter your desired name for the results (e.g., SARS-2_RBD-ACE2_energies). Next, we need to give NAMDEnergy the parameter file par_all36_prot.prm. This file should be found at VMD &gt; plugins &gt; noarch &gt; tcl &gt; readcharmmpar1.3 &gt; par_all36_prot.prm. Finally, click Run NAMDEnergy.The output file will be created in your current working directory and can be opened with a simple text-editor. The values of your results may vary slightly upon repetitive calculations.Note: You may be wondering why the interaction energy comes out to be a negative number. In physics, a negative value indicates an attractive force between two molecules, and a positive value indicates a repulsive force.We will now focus on the interaction energy between the SARS-CoV-2 RBD loop site (residues 482 to 486) and ACE2. In the NAMDEnergy window, enter protein and chain B for Selection 1 and protein and chain F and (resid 482 to 486) for Selection 2. Keep all other settings the same. You should see output results similar to the following.The above results seem to indicate that the interaction between SARS-CoV-2 RBD and ACE2 is a favorable interaction, and that the loop region contributes to this bonding. Yet our goal was to compare the total energy of the bound RBD-ACE2 complex in SARS-CoV-2 against that of SARS-CoV, as well as to compare the energy contributed by the three regions of structural difference that we identified in the main text. We will leave these comparisons to you as an exercise, and we will discuss the results in the main text.STOP: First, compute the total energy of the SARS-CoV RBD complex with ACE2 (PDB entry: 2ajf). How does it compare against the energy of the SARS-CoV-2 complex? Then, compute the energy contributed by hotspot 31 and hotspot 353 in SARS-CoV-2, as well as that of the regions corresponding to these regions and the loop region in SARS-CoV. (Consult the table below as needed.) How do the energy contributions of corresponding regions compare? Is this surprising, and what can we conclude?Note: In the table below, “chain B” is part of the ACE2 enzyme, and “chain F” is part of the viral spike protein RBD for the virus indicated.            Model      Region      Selection 1      Selection 2                  SARS-CoV-2 (6vw1)      Total      protein and chain B      protein and chain F              SARS-CoV (2ajf)      Total      protein and chain B      protein and chain F              SARS-CoV-2 (6vw1)      Loop      protein and chain B      protein and chain F and (resid 482 to 486)              SARS-CoV (2ajf)      Loop      protein and chain B      protein and chain F and (resid 468 to 472)              SARS-CoV-2 (6vw1)      Hotspot31      protein and chain B      protein and chain F and resid 455              SARS-CoV-2 (6vw1)      Hotspot31      protein and chain B and (resid31 or resid 35)      protein and chain F              SARS-CoV (2ajf)      Hotspot31      protein and chain B      protein and chain F and resid 442              SARS-CoV (2ajf)      Hotspot31      protein and chain B and (resid 31 or resid 35)      protein and chain F              SARS-CoV-2 (6vw1)      Hotspot353      protein and chain B      protein and chain F and resid 501              SARS-CoV-2 (6vw1)      Hotspot353      protein and chain B and (resid 38 or resid 353)      protein and chain F              SARS-CoV (2ajf)      Hotspot353      protein and chain B      protein and chain F and resid 487              SARS-CoV (2ajf)      Hotspot353      protein and chain B and (resid 38 or resid 353)      protein and chain F      Return to main text            https://www.charmmtutorial.org/index.php/The_Energy_Function &#8617;              https://www.ks.uiuc.edu/Training/Tutorials/namd/namd-tutorial-unix-html/node23.html &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Using ab initio Modeling to Predict the Structure of Hemoglobin Subunit Alpha",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_ab_initio",
        "date"     : "",
        "content"  : "In this software tutorial, we will use the popular ab initio modeling software called QUARK. Because of the complexity of ab initio algorithms, QUARK limits us to polypeptides with at most 200 amino acids, and so rather than determining the structure of the SARS-CoV-2 spike protein (each monomer has 1281 amino acids), we will work with hemoglobin subunit alpha (PDB entry 1si4), which is only 141 amino acids long.Before beginning, if you have not used QUARK before, then you will need to register for a QUARK account to use this software. After registering, you will receive an email containing a temporary password.Note: At the current time, QUARK only accepts registrations from university E-mail accounts. If you do not have such an account, please send an E-mail to  yangzhanglab@umich.edu and mention that you are a Biological Modeling learner with Phillip Compeau to receive access.Then, download the primary sequence of human hemoglobin subunit alpha. Visit QUARK to find the submission page for QUARK, where you should take the following steps as shown in the figure below.  Copy and paste the sequence into the first box.  Add your email address and password.  Click Run QUARK.Even though this is a short protein, it will take at least a few hours to run your submission, depending on server load. When your job has finished, you will receive an email notification and be able to download the results. In the meantime, you may like to join us back in the main text.Note: QUARK will not return a single best answer but rather the top five best-scoring structures that it finds. Continuing the exploration analogy from the text, think of these results as the five lowest points in the search space that QUARK is able to find.In the main text, we will show a figure of our models and compare them to the known structure of human hemoglobin subunit alpha from the PDB entry 1si4. You can also download our completed models if you like.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Modeling bacterial adaptation to changing attractant",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_adaptation",
        "date"     : "",
        "content"  : "In this tutorial, we will extend the BioNetGen model covered in the phosphorylation tutorial to add the methylation mechanisms described in the main text to our ongoing model of bacterial chemotaxis. Our model will be based on the model by Spiro et al.1We will also add compartmentalization to our model, which will allow us to differentiate molecules that occur inside and outside of the cell.Finally, after running our model, we will see how methylation can be used to help the bacterium adapt to a relative change in attractant concentration. For reference, consult the figure below, reproduced from the main text, for an overview of the chemotaxis pathway.The chemotaxis signal-transduction pathway with methylation included. CheA phosphorylates CheB, which methylates MCPs, while CheR demethylates MCPs. Blue lines denote phosphorylation, grey lines denote dephosphorylation, and the green arrow denotes methylation. Image modified from Parkinson Lab’s illustrations.To get started, open Visual Studio Code, and click File &gt; Open Folder.... Open the EColiSimulations folder from the first tutorial. Create a copy of your file from the phosphorylation tutorial and save it as adaptation.bngl. If you would rather not follow along below, you can download a completed BioNetGen file here: adaptation.bngl.Specifying molecule typesWe first will add all molecules needed for our model. As mentioned in the main text, we will assume that an MCP can have one of three methylation states: low (A), medium (B), and high (C). We also need to include a component that will allow for the receptor to bind to CheR. As a result, we update our MCP molecule to T(l,r,Meth~A~B~C,Phos~U~P).Furthermore, we need to represent CheR and CheB; recall that CheR binds to and methylates receptor complexes, while CheB demethylates them. CheR can bind to T, so that we will need the molecule CheR(t). CheB is phosphorylated by CheY, and so it will be represented as CheB(Phos~U~P). Later we will specify reactions specifying how CheR and CheB change the methylation states of receptor complexes.begin molecule types	L(t)	T(l,r,Meth~A~B~C,Phos~U~P)	CheY(Phos~U~P)	CheZ()	CheB(Phos~U~P)	CheR(t)end molecule typesIn the observable section, we specify that we are interested in tracking the concentrations of the bound ligand, phosphorylated CheY and CheB, and the receptor at each methylation level.begin observables	Molecules bound_ligand L(t!1).T(l!1)	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules low_methyl_receptor T(Meth~A)	Molecules medium_methyl_receptor T(Meth~B)	Molecules high_methyl_receptor T(Meth~C)	Molecules phosphorylated_CheB CheB(Phos~P)end observablesDefining reactionsWe now expand our reaction rules to include methylation. First, we change the autophosphorylation rules of the receptor to have different rates depending on whether the receptor is bound and its current methylation level, which produces six rules.Note: We cannot avoid combinatorial explosion in the case of these phosphorylation reactions because they take place at different rates.) In what follows, we use experimentally verified reaction rates.#Receptor complex (specifically CheA) autophosphorylation#Rate dependent on methylation and binding states#Also on free vs. bound with ligandTaUnboundP: T(l,Meth~A,Phos~U) -&gt; T(l,Meth~A,Phos~P) k_TaUnbound_phosTbUnboundP: T(l,Meth~B,Phos~U) -&gt; T(l,Meth~B,Phos~P) k_TaUnbound_phos*1.1TcUnboundP: T(l,Meth~C,Phos~U) -&gt; T(l,Meth~C,Phos~P) k_TaUnbound_phos*2.8TaLigandP: L(t!1).T(l!1,Meth~A,Phos~U) -&gt; L(t!1).T(l!1,Meth~A,Phos~P) 0TbLigandP: L(t!1).T(l!1,Meth~B,Phos~U) -&gt; L(t!1).T(l!1,Meth~B,Phos~P) k_TaUnbound_phos*0.8TcLigandP: L(t!1).T(l!1,Meth~C,Phos~U) -&gt; L(t!1).T(l!1,Meth~C,Phos~P) k_TaUnbound_phos*1.6Next, we will need reactions for CheR binding to receptor complexes and methylating them. First, we consider the binding of CheR to the receptor.#CheR binding to receptor complexTRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, 1Second, we will need multiple reaction rules for methylation of receptors by CheR because the rate of the reaction can depend on whether a ligand is already bound to the receptor as well as the current methylation level of the receptor. This gives us four rules, since a receptor at the “high” methylation level (C) cannot have increased methylation. Note also that the rate of the methylation reaction is higher if the methylation level is low (A) and significantly higher if the receptor is already bound.#CheR methylating the receptor complex#Rate of methylation is dependent on methylation states and ligand bindingTRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, k_TR_disTaRUnboundMeth: T(r!2,l,Meth~A).CheR(t!2) -&gt; T(r,l,Meth~B) + CheR(t) k_TaR_methTbRUnboundMeth: T(r!2,l,Meth~B).CheR(t!2) -&gt; T(r,l,Meth~C) + CheR(t) k_TaR_meth*0.1TaRLigandMeth: T(r!2,l!1,Meth~A).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~B).L(t!1) + CheR(t) k_TaR_meth*30TbRLigandMeth: T(r!2,l!1,Meth~B).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~C).L(t!1) + CheR(t) k_TaR_meth*3Finally, we need reactions for CheB. First, we consider its phosphorylation by the receptor and its autodephosphorylation. Each of these two reactions occurs at a rate that is independent of any other state of the receptor or CheB.#CheB is phosphorylated by receptor complex, and autodephosphorylatesCheBphos: T(Phos~P) + CheB(Phos~U) -&gt; T(Phos~U) + CheB(Phos~P) k_B_phosCheBdephos: CheB(Phos~P) -&gt; CheB(Phos~U) k_B_dephosCheB also demethylates the receptor complex, at a rate that depends on the current methylation state. (We do not include state A since it cannot be further demthylated.)#CheB demethylates receptor complex#Rate dependent on methylation statesTbDemeth: T(Meth~B) + CheB(Phos~P) -&gt; T(Meth~A) + CheB(Phos~P) k_Tb_demethTcDemeth: T(Meth~C) + CheB(Phos~P) -&gt; T(Meth~B) + CheB(Phos~P) k_Tc_demethWe are now ready to combine the above reaction rules with the reaction rules inherited from the original model (ligand-receptor binding and CheY phosphorylation/dephosphorylation) to give us a complete set of reaction rules. As pointed out in the main text, were we to write out all possible reactions that are implied from these rules, we would have an enormous model. BioNetGen takes the following rules and converts them into all necessary reactions for us behind the scenes.begin reaction rules  #Ligand-receptor binding	LigandReceptor: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis  #CheY phosphorylation by T and dephosphorylation by CheZ  YPhos: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos  YDephos: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephos	#Receptor complex (specifically CheA) autophosphorylation	#Rate dependent on methylation and binding states	#Also on free vs. bound with ligand	TaUnboundP: T(l,Meth~A,Phos~U) -&gt; T(l,Meth~A,Phos~P) k_TaUnbound_phos	TbUnboundP: T(l,Meth~B,Phos~U) -&gt; T(l,Meth~B,Phos~P) k_TaUnbound_phos*1.1	TcUnboundP: T(l,Meth~C,Phos~U) -&gt; T(l,Meth~C,Phos~P) k_TaUnbound_phos*2.8	TaLigandP: L(t!1).T(l!1,Meth~A,Phos~U) -&gt; L(t!1).T(l!1,Meth~A,Phos~P) 0	TbLigandP: L(t!1).T(l!1,Meth~B,Phos~U) -&gt; L(t!1).T(l!1,Meth~B,Phos~P) k_TaUnbound_phos*0.8	TcLigandP: L(t!1).T(l!1,Meth~C,Phos~U) -&gt; L(t!1).T(l!1,Meth~C,Phos~P) k_TaUnbound_phos*1.6	#CheY phosphorylation by T and dephosphorylation by CheZ	YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephos	#CheR binds to and methylates receptor complex	#Rate dependent on methylation states and ligand binding	TRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, k_TR_dis	TaRUnboundMeth: T(r!2,l,Meth~A).CheR(t!2) -&gt; T(r,l,Meth~B) + CheR(t) k_TaR_meth	TbRUnboundMeth: T(r!2,l,Meth~B).CheR(t!2) -&gt; T(r,l,Meth~C) + CheR(t) k_TaR_meth*0.1	TaRLigandMeth: T(r!2,l!1,Meth~A).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~B).L(t!1) + CheR(t) k_TaR_meth*30	TbRLigandMeth: T(r!2,l!1,Meth~B).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~C).L(t!1) + CheR(t) k_TaR_meth*3	#CheB is phosphorylated by receptor complex, and autodephosphorylates	CheBphos: T(Phos~P) + CheB(Phos~U) -&gt; T(Phos~U) + CheB(Phos~P) k_B_phos	CheBdephos: CheB(Phos~P) -&gt; CheB(Phos~U) k_B_dephos	#CheB demethylates receptor complex	#Rate dependent on methyaltion states	TbDemeth: T(Meth~B) + CheB(Phos~P) -&gt; T(Meth~A) + CheB(Phos~P) k_Tb_demeth	TcDemeth: T(Meth~C) + CheB(Phos~P) -&gt; T(Meth~B) + CheB(Phos~P) k_Tc_demethend reaction rulesAdding CompartmentsIn biological systems, the plasma membrane separates molecules inside of the cell from the external environment. In our chemotaxis system, ligands are outside of the cell, receptors and flagellar proteins are on the membrane, and CheY, CheR, CheB, CheZ are inside the cell.BioNetGen allows us to compartmentalize our model based on the location of different molecules. Although our model does not call for compartmentalization, it has value in models where we need different concentrations based on different cellular compartments, influencing the rates of reactions involving molecules within these compartments. For this reason, we will take the opportunity to add compartmentalization into our model.Below, we define three compartments corresponding to extra-cellular space (outside the cell), the plasma membrane, and the cytoplasm (inside the cell). Each row indicates four parameters:  the name of the compartment;  the dimension (2-D or 3-D);  surface area (2-D) or volume (3-D) of the compartment; and  the name of the parent compartment - the compartment that encloses this current compartment.If you are interested, more information on compartmentalization can be found on pages 54-55 of Sekar and Faeder’s primer on rule-based modeling: http://www.lehman.edu/academics/cmacs/documents/RuleBasedPrimer-2011.pdf.begin compartments	EC  3  100       #um^3	PM  2  1   EC    #um^2	CP  3  1   PM    #um^3end compartmentsSpecifying concentrations and reaction ratesTo add compartmentalization information in the species section of our BioNetGen model, we use the notation @location before the specification of the  concentrations. In what follows, we specify initial concentrations of ligand, receptor, and chemotaxis enzymes at different states.  The distribution of molecule concentrations at each state is very difficult to verify experimentally; the distribution provided here approximates equilibrium concentrations in our simulation, and they are within a biologically reasonable range.2begin species	@EC:L(t) L0	@PM:T(l,r,Meth~A,Phos~U) T0*0.84*0.9	@PM:T(l,r,Meth~B,Phos~U) T0*0.15*0.9	@PM:T(l,r,Meth~C,Phos~U) T0*0.01*0.9	@PM:T(l,r,Meth~A,Phos~P) T0*0.84*0.1	@PM:T(l,r,Meth~B,Phos~P) T0*0.15*0.1	@PM:T(l,r,Meth~C,Phos~P) T0*0.01*0.1	@CP:CheY(Phos~U) CheY0*0.71	@CP:CheY(Phos~P) CheY0*0.29	@CP:CheZ() CheZ0	@CP:CheB(Phos~U) CheB0*0.62	@CP:CheB(Phos~P) CheB0*0.38	@CP:CheR(t) CheR0end speciesFinally, we need to assign values to the parameters. We will assume that we start with a zero ligand concentration.  We then assign the initial concentration of each molecule and rates of our reactions based on in vivo stoichiometry and parameter tuning 34.Note: Although we discussed reaction rules first, the parameters section below has to appear before the reaction rules section.begin parameters	NaV 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	miu 1e-6	L0 0             #number of molecules/cell	T0 7000          #number of molecules/cell	CheY0 20000      #number of molecules/cell	CheZ0 6000       #number of molecules/cell	CheR0 120        #number of molecules/cell	CheB0 250        #number of molecules/cell	k_lr_bind 8.8e6/NaV    #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_TaUnbound_phos 7.5   #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV     #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV   #Z dephosphorylates Y	k_TR_bind  2e7/NaV     #Receptor-CheR binding	k_TR_dis   1           #Receptor-CheR dissociation	k_TaR_meth 0.08        #CheR methylates receptor complex	k_B_phos 1e5/NaV       #CheB phosphorylation by receptor complex	k_B_dephos 0.17        #CheB autodephosphorylation	k_Tb_demeth 5e4/NaV    #CheB demethylates receptor complex	k_Tc_demeth 2e4/NaV    #CheB demethylates receptor complexend parametersCompleting our adaptation simulationWe will be ready to simulate once we place the following code after end model. We will run our simulation for 800 seconds.generate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;800, n_steps=&gt;800})The following code contains our complete simulation.begin modelbegin molecule types	L(t)	T(l,r,Meth~A~B~C,Phos~U~P)	CheY(Phos~U~P)	CheZ()	CheB(Phos~U~P)	CheR(t)end molecule typesbegin observables	Molecules bound_ligand L(t!1).T(l!1)	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules low_methyl_receptor T(Meth~A)	Molecules medium_methyl_receptor T(Meth~B)	Molecules high_methyl_receptor T(Meth~C)	Molecules phosphorylated_CheB CheB(Phos~P)	Molecules CheRbound T(r!2).CheR(t!2)end observablesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	miu 1e-6	L0 0	T0 7000	CheY0 20000	CheZ0 6000	CheR0 120	CheB0 250	k_lr_bind 8.8e6/NaV2   #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_TaUnbound_phos 7.5   #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV2    #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV2  #Z dephosphorylates Y	k_TR_bind 2e7/NaV2          #Receptor-CheR binding	k_TR_dis  1            #Receptor-CheR dissociaton	k_TaR_meth 0.08        #CheR methylates receptor complex	k_B_phos 1e5/NaV2      #CheB phosphorylation by receptor complex	k_B_dephos 0.17        #CheB autodephosphorylation	k_Tb_demeth 5e4/NaV2   #CheB demethylates receptor complex	k_Tc_demeth 2e4/NaV2   #CheB demethylates receptor complexend parametersbegin reaction rules	LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis	#Receptor complex (specifically CheA) autophosphorylation	#Rate dependent on methylation and binding states	#Also on free vs. bound with ligand	TaUnboundP: T(l,Meth~A,Phos~U) -&gt; T(l,Meth~A,Phos~P) k_TaUnbound_phos	TbUnboundP: T(l,Meth~B,Phos~U) -&gt; T(l,Meth~B,Phos~P) k_TaUnbound_phos*1.1	TcUnboundP: T(l,Meth~C,Phos~U) -&gt; T(l,Meth~C,Phos~P) k_TaUnbound_phos*2.8	TaLigandP: L(t!1).T(l!1,Meth~A,Phos~U) -&gt; L(t!1).T(l!1,Meth~A,Phos~P) 0	TbLigandP: L(t!1).T(l!1,Meth~B,Phos~U) -&gt; L(t!1).T(l!1,Meth~B,Phos~P) k_TaUnbound_phos*0.8	TcLigandP: L(t!1).T(l!1,Meth~C,Phos~U) -&gt; L(t!1).T(l!1,Meth~C,Phos~P) k_TaUnbound_phos*1.6	#CheY phosphorylation by T and dephosphorylation by CheZ	YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephos	#CheR binds to and methylates receptor complex	#Rate dependent on methylation states and ligand binding	TRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, k_TR_dis	TaRUnboundMeth: T(r!2,l,Meth~A).CheR(t!2) -&gt; T(r,l,Meth~B) + CheR(t) k_TaR_meth	TbRUnboundMeth: T(r!2,l,Meth~B).CheR(t!2) -&gt; T(r,l,Meth~C) + CheR(t) k_TaR_meth*0.1	TaRLigandMeth: T(r!2,l!1,Meth~A).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~B).L(t!1) + CheR(t) k_TaR_meth*30	TbRLigandMeth: T(r!2,l!1,Meth~B).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~C).L(t!1) + CheR(t) k_TaR_meth*3	#CheB is phosphorylated by receptor complex, and autodephosphorylates	CheBphos: T(Phos~P) + CheB(Phos~U) -&gt; T(Phos~U) + CheB(Phos~P) k_B_phos	CheBdephos: CheB(Phos~P) -&gt; CheB(Phos~U) k_B_dephos	#CheB demethylates receptor complex	#Rate dependent on methyaltion states	TbDemeth: T(Meth~B) + CheB(Phos~P) -&gt; T(Meth~A) + CheB(Phos~P) k_Tb_demeth	TcDemeth: T(Meth~C) + CheB(Phos~P) -&gt; T(Meth~B) + CheB(Phos~P) k_Tc_demethend reaction rulesbegin compartments  EC  3  100       #um^3  PM  2  1   EC    #um^2  CP  3  1   PM    #um^3end compartmentsbegin species	@EC:L(t) L0	@PM:T(l,r,Meth~A,Phos~U) T0*0.84*0.9	@PM:T(l,r,Meth~B,Phos~U) T0*0.15*0.9	@PM:T(l,r,Meth~C,Phos~U) T0*0.01*0.9	@PM:T(l,r,Meth~A,Phos~P) T0*0.84*0.1	@PM:T(l,r,Meth~B,Phos~P) T0*0.15*0.1	@PM:T(l,r,Meth~C,Phos~P) T0*0.01*0.1	@CP:CheY(Phos~U) CheY0*0.71	@CP:CheY(Phos~P) CheY0*0.29	@CP:CheZ() CheZ0	@CP:CheB(Phos~U) CheB0*0.62	@CP:CheB(Phos~P) CheB0*0.38	@CP:CheR(t) CheR0end speciesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;800, n_steps=&gt;800})Running our adaptation modelNow save your file and run the simulation by clicking on the Run BNG button. The results will be saved in a new folder called adaptation/TIMESTAMP contained in the current directory. Rename the newly created folder from the time stamp to L0_0.Next, open the newly created adaptation.gdat file in your results folder and create a plot by clicking the Built-in plotting button.Because the model is at equilibrium, we will see the seemingly boring plot shown below.Things get interesting when we change the initial concentration of ligand to see how the simulated bacterium will adapt. Run your simulation with L0 = 1e6. What happens to CheY activity? What happens to the concentration of receptors at different methylation states?Try a variety of different initial concentrations of ligand (L0 = 1e4, 1e5, 1e6, 1e7, 1e8), paying attention to the concentration of phosphorylated CheY. How does the concentration change depending on initial ligand concentration?Then try to further raise the ligand concentration to 1e9 and 1e10. How does this affect the outcome of the simulation? Why?Next, try only simulating the first 10 seconds to zoom into what happens to the system at the start. How quickly does CheY concentration reach a minimum?  How long does the cell take to return to the original concentration of phosphorylated CheY (i.e., the background tumbling frequency)?Back in the main text, we will examine how a sudden change in the concentration of unbound ligand can cause a quick change in the tumbling frequency of the bacterium, followed by a slow return to its original frequency. We will also see how the extent to which this tumbling frequency is disturbed is dependent upon differences in the initial concentration of ligand.Return to main text            Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. &#8617;              Bray D, Bourret RB, Simon MI. 1993. Computer simulation of phosphorylation cascade controlling bacterial chemotaxis. Molecular Biology of the Cell. Available online &#8617;              Li M, Hazelbauer GL. 2004. Cellular stoichimetry of the components of the chemotaxis signaling complex. Journal of Bacteriology. Available online &#8617;              Stock J, Lukat GS. 1991. Intracellular signal transduction networks. Annual Review of Biophysics and Biophysical Chemistry. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Building a Diffusion Cellular Automaton",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/tutorial-diffusion",
        "date"     : "",
        "content"  : "In this tutorial, we will use Python to build a Jupyter notebook. We suggest only following the tutorial closely if you are familiar with Python or programming in general. If you have not installed Python, then the following software and packages will need to be installed:            Installation Link      Version1      Check Install                  Python3      3.7      python –version              Jupyter Notebook      4.4.0      jupyter –version              matplotlib      2.2.3      conda list or pip list              numpy      1.15.1      conda list or pip list              scipy      1.1.0      conda list or pip list              imageio      2.4.1      conda list or pip list      You can read more about various installation options here or here.Once you have Jupyter Notebook installed, create a new notebook file called diffusion_automaton.ipynb.Note: You will need to save this file on the same level as another folder named /dif_images. ImageIO will not always create this folder automatically, so you may need to create it manually.You may also download the completed tutorial here.We are now ready to simulate our automaton representing the diffusion of two particle species: a prey (A) and a predator (B). Enter the following into our notebook.import matplotlib.pyplot as pltimport numpy as npimport timefrom scipy import signalimport imageio%matplotlib inlineimages = []To simulate the diffusion process, we will rely upon an imported convolution function. The convolve function will use a specified 3 x 3 laplacian matrix to simulate diffusion as discussed in the main text. Specifically, the convolve function in this case takes two matrices, mtx and lapl, and uses lapl as a set of multipliers for each square in mtx. We can see this operation in action in the image below.A single step in the convolution function which takes the first matrix and adds up each cell multiplied by the number in the second matrix. Here we see (0 * 0) + (2 * ¼) + (0 * 0) + (3 * ¼) + (1 * -1) + (2 * ¼) + (1 * 0) + (1 * ¼) +(1 * 0) = 1Because we’re trying to describe the rate of diffusion over this system, the values in the 3 x 3 laplacian excluding the center sum to 1. In our code, the value in the center is -1 because we’ve specified the change in the system with the convolution function i.e. the matrix dA, which we then add to the original matrix A. Thus the total sum of the laplacian is 0 which means the total change in number of molecules due to diffusion is 0, even if the molecules are moving to new locations. We don’t want any new molecules created due to just diffusion! (This would violate the law of conservation of mass.)We are now ready to write a Python function Diffuse that we will add to our notebook. This function will take a collection of parameters:  numIter: the number of steps to run our simulation  A, B: matrices containing the respective concentrations of prey and predators in each cell  dt: the unit of time  dA, dB: diffusion rates for prey and predators, respectively  lapl: our 3 x 3 Laplacian matrix  plot_iter: the number of steps to “skip” when animating our simulationdef Diffuse(numIter, A, B, dt, dA, dB, lapl, plot_iter):    print("Running Simulation")    start = time.time()    # Run the simulation    for iter in range(numIter):        A_new = A + (dA * signal.convolve2d(A, lapl, mode='same', boundary='fill', fillvalue=0)) * dt        B_new = B + (dB * signal.convolve2d(B, lapl, mode='same', boundary='fill', fillvalue=0)) * dt        A = np.copy(A_new)        B = np.copy(B_new)        if (iter % plot_iter is 0):            plt.clf()            plt.imshow((B / (A+B)),cmap='Spectral')            plt.axis('off')            now = time.time()            # print("Seconds since epoch =", now-start)            # plt.show()            filename = 'dif_images/diffusion_'+str(iter)+'.png'            plt.savefig(filename)            images.append(imageio.imread(filename))    return A, BThe following parameters will set up our problem space by defining the grid size, the number of iterations we will range through, and establishing the initial matrices A and B.# _*_*_*_*_*_*_*_*_* GRID PROPERTIES *_*_*_*_*_*_*_*_*_*grid_size = 101 # Needs to be oddnumIter = 10000;seed_size = 11 # Needs to be an odd numberA = np.ones((grid_size,grid_size))B = np.zeros((grid_size,grid_size))# Seed the predatorsB[int(grid_size/2)-int(seed_size/2):int(grid_size/2)+int(seed_size/2)+1, \int(grid_size/2)-int(seed_size/2):int(grid_size/2)+int(seed_size/2)+1] = \np.ones((seed_size,seed_size))The following parameters will establish the time step, the diffusion rates, and how many steps will be between frames of our animation.# _*_*_*_*_*_*_*_*_* SIMULATION VARIABLES *_*_*_*_*_*_*_*_*_*dt = 1.0dA = 0.5dB = 0.25lapl = np.array([[0.05, 0.2, 0.05],[0.2, -1.0, 0.2],[0.05, 0.2, 0.05]])plot_iter = 50Diffuse(numIter, A, B, dt, dA, dB, lapl, plot_iter)imageio.mimsave('dif_images/diffusion_movie.gif', images)We now are ready to save and run our notebook. When you run the notebook, you should see an animation in which concentrations of predators are spreading out against a field of prey.As we return to the main text, we will discuss this animation and extend our model to be able to handle reactions as well as diffusion.Return to main text            Other versions may be compatible with this code, but those listed are known to work for this tutorial &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Implementing the Feed-Forward Loop Motif",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/tutorial_feed",
        "date"     : "",
        "content"  : "Note: We are currently in the process of updating this tutorial to the latest version of MCell, CellBlender, and Blender. This tutorial works with MCell3, CellBlender 3.5.1, and Blender 2.79. Please see a previous tutorial for a link to download these versions.In this tutorial, we will use CellBlender to run a (mathematically controlled) comparison of simple regulation against regulation via the type-1 incoherent feed-forward loop that we saw in the main text.Load your CellBlender_Tutorial_Template.blend file from the Random Walk Tutorial. Save your file as ffl.blend. You may also download the completed tutorial file here.Go to CellBlender &gt; Molecules and create the following molecules:  Click the + button.  Select a color (such as white).  Name the molecule X1.  Select the molecule type as Surface Molecule.  Add a diffusion constant of 1e-6.  Up the scale factor to 5 (click and type “5” or use the arrows).Repeat the above steps to make sure that the following molecules are entered with the appropriate parameters.            Molecule Name      Molecule Type      Diffusion Constant      Scale Factor                  X1      Surface      1e-6      5              Z1      Surface      1e-6      1              X2      Surface      1e-6      5              Y2      Surface      1e-6      1              Z2      Surface      1e-6      1      Now go to CellBlender &gt; Molecule Placement to set the following release sites for our molecules:  Click the + button.  Select or type in the molecule X1.  Type in the name of the Object/Region Plane.  Set the Quantity to Release as 300.Repeat the above steps to make sure all of the following molecule release sites are entered.            Molecule Name      Object/Region      Quantity to Release                  X1      Plane      300              X2      Plane      300      Next go to CellBlender &gt; Reactions to create the following reactions:  Click the + button.  Under reactants, type X1’ (note the apostrophe).  Under products, type X1’ + Z1’.  Set the forward rate as 4e2.Repeat the above steps for the following reactions.            Reactants      Products      Forward Rate                  X1’      X1’ + Z1’      4e2              Z1’      NULL      4e2              X2’      X2’ + Y2’      2e2              X2’      X2’ + Z2’      4e3              Y2’ + Z2’      Y2’      4e2              Y2’      NULL      4e2              Z2’      NULL      4e2      Go to CellBlender &gt; Plot Output Settings to set up a plot as follows:  Click the + button.  Set the molecule name as Z1.  Ensure World is selected.  Ensure Java Plotter is selected.  Ensure One Page, Multiple Plots is selected.  Ensure Molecule Colors is selected.Repeat the above steps to ensure that we plot all of the following molecules.            Molecule Name      Selected Region                  Z1      World              Z2      World      We are now ready to run our simulation. Go to CellBlender &gt; Run Simulation and select the following options:  Set the number of iterations to 12000.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.Once the simulation has run, we can visualize our data with CellBlender &gt; Reload Visualization Data.If you like, you can watch the animation within the Blender window by clicking the play button at the bottom of the screen.Now go back to CellBlender &gt; Plot Output Settings and scroll to the bottom to click “Plot”. This will produce a plot of the amount of Z under simple regulation compared to the amount of Z for the feed-forward loop. Is it what you expected?Save your file, and then use the link below to return to the main text, where we will interpret the outcome of our simulation.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Traveling Up an Attractant Gradient",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_gradient",
        "date"     : "",
        "content"  : "In the previous tutorial, we modeled how bacteria react and adapt to a one-time addition of attractants. In real life, bacteria don’t suddenly drop into an environment with more attractants; instead, they explore a variable environment. In this tutorial, we will adapt our model to simulate a bacterium as it travels up an exponentially increasing concentration gradient.We will also explore defining and using functions, a feature of BioNetGen that will allow us to specify reaction rules in which the reaction rates are dependent on the current state of the system.To get started, open Visual Studio Code, and click File &gt; Open Folder.... Open the EColiSimulations folder from the first tutorial. Create a copy of your adaptation.bngl file from the adaptation tutorial and save it as addition.bngl. If you would rather not follow along below, you can download a completed BioNetGen file here: addition.bnglWe also will build a Jupyter notebook in this tutorial for plotting the concentrations of molecules over time. You should create a file called plotter_up.ipynb; if you would rather not follow along, we provide a completed notebook here:plotter_up.ipynbBefore running this notebook, make sure the following dependencies are installed.            Installation Link      Version      Check install/version                  Python3      3.6+      python --version              Jupyter Notebook      4.4.0+      jupyter --version              Numpy      1.14.5+      pip list \| grep numpy              Matplotlib      3.0+      pip list \| grep matplotlib              Colorspace or with pip      any      pip list \| grep colorspace      Modeling an increasing ligand gradient with a BioNetGen functionOur BioNetGen model will largely stay the same, except for the fact that we are changing the concentration of ligand over time. To model an increasing concentration of ligand corresponding to a bacterium moving up an attractant gradient, we will increase the background ligand concentration at an exponential rate.We will simulate an increase in attractant concentration by using a “dummy reaction” L  2L in which one ligand molecule becomes two. To do so, we will add the following reaction to the reaction rules section.As we have observed earlier in this module, when the ligand concentration is very high, receptors are saturated, and the cell can no longer detect a change in ligand concentration. If you explored the adaptation simulation, then you saw that this occurs after l0 passes 1e8; we will therefore cap the allowable ligand concentration at this value.We can cap our ligand concentration by defining the rate of the dummy reaction using a function add_Rate(). This function requires another observable, AllLigand. By adding the line Molecules AllLigand L in the observables section, AllLigand will record the total concentration of ligand in the system at each time step (both bound and unbound). As for our reaction, if AllLigand is less than 1e8, then the dummy reaction should take place at some given rate k_add. Otherwise, AllLigand exceeds1e8, and we will set the rate of the dummy reaction to zero. This can be achieved with a functions section in BioNetGen using the following if statement to branch based on the value of AllLigand.Note: Please ensure that the functions section occurs before the reaction rules section in your BioNetGen file.begin functions	addRate() = if(AllLigand&gt;1e8,0,k_add)end functionsNow we are ready to add our dummy reaction to the reaction rules section with a reaction rate of addRate().#Simulate an exponentially increasing gradient using a dummy reactionLAdd: L(t) -&gt; L(t) + L(t) addRate()Now that we have defined our dummy reaction, we should specify the default rate of this reaction k_add in the parameters section. We first will try a value of k_add of 0.1/s with an initial ligand concentration L0 of 1e4. This means that the model is simulating a gradient of d[L]/dt = 0.1[L]. If L0 is 1e4, then the solution to this differential equation is [L] = 1000e0.1t molecules per second.k_add 0.1L0 1e4Running our updated BioNetGen modelBecause we have largely kept the same model from the adaptation tutorial, we are ready to simulate. Please make sure that the following lines appear after end model so that we can run our simulation for 1000 seconds.generate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;1000, n_steps=&gt;500})The following code contains our complete simulation, which you can also download here:addition.bnglbegin modelbegin molecule types	L(t)	T(l,r,Meth~A~B~C,Phos~U~P)	CheY(Phos~U~P)	CheZ()	CheB(Phos~U~P)	CheR(t)end molecule typesbegin observables	Molecules bound_ligand L(t!1).T(l!1)	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules low_methyl_receptor T(Meth~A)	Molecules medium_methyl_receptor T(Meth~B)	Molecules high_methyl_receptor T(Meth~C)	Molecules phosphorylated_CheB CheB(Phos~P)	Molecules AllLigand Lend observablesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	miu 1e-6	L0 1e4	T0 7000	CheY0 20000	CheZ0 6000	CheR0 120	CheB0 250	k_lr_bind 8.8e6/NaV2   #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_TaUnbound_phos 7.5   #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV2    #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV2  #Z dephosphorylates Y	k_TR_bind 2e7/NaV2          #Receptor-CheR binding	k_TR_dis  1            #Receptor-CheR dissociaton	k_TaR_meth 0.08        #CheR methylates receptor complex	k_B_phos 1e5/NaV2      #CheB phosphorylation by receptor complex	k_B_dephos 0.17        #CheB autodephosphorylation	k_Tb_demeth 5e4/NaV2   #CheB demethylates receptor complex	k_Tc_demeth 2e4/NaV2   #CheB demethylates receptor complex	k_add 0.1              #Ligand increaseend parametersbegin functions	addRate() = if(AllLigand&gt;1e8,0,k_add)end functionsbegin reaction rules	LigandReceptor: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis	#Receptor complex (specifically CheA) autophosphorylation	#Rate dependent on methylation and binding states	#Also on free vs. bound with ligand	TaUnboundP: T(l,Meth~A,Phos~U) -&gt; T(l,Meth~A,Phos~P) k_TaUnbound_phos	TbUnboundP: T(l,Meth~B,Phos~U) -&gt; T(l,Meth~B,Phos~P) k_TaUnbound_phos*1.1	TcUnboundP: T(l,Meth~C,Phos~U) -&gt; T(l,Meth~C,Phos~P) k_TaUnbound_phos*2.8	TaLigandP: L(t!1).T(l!1,Meth~A,Phos~U) -&gt; L(t!1).T(l!1,Meth~A,Phos~P) 0	TbLigandP: L(t!1).T(l!1,Meth~B,Phos~U) -&gt; L(t!1).T(l!1,Meth~B,Phos~P) k_TaUnbound_phos*0.8	TcLigandP: L(t!1).T(l!1,Meth~C,Phos~U) -&gt; L(t!1).T(l!1,Meth~C,Phos~P) k_TaUnbound_phos*1.6	#CheY phosphorylation by T and dephosphorylation by CheZ	YPhos: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDephos: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephos	#CheR binds to and methylates receptor complex	#Rate dependent on methylation states and ligand binding	TRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, k_TR_dis	TaRUnboundMeth: T(r!2,l,Meth~A).CheR(t!2) -&gt; T(r,l,Meth~B) + CheR(t) k_TaR_meth	TbRUnboundMeth: T(r!2,l,Meth~B).CheR(t!2) -&gt; T(r,l,Meth~C) + CheR(t) k_TaR_meth*0.1	TaRLigandMeth: T(r!2,l!1,Meth~A).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~B).L(t!1) + CheR(t) k_TaR_meth*30	TbRLigandMeth: T(r!2,l!1,Meth~B).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~C).L(t!1) + CheR(t) k_TaR_meth*3	#CheB is phosphorylated by receptor complex, and autodephosphorylates	CheBphos: T(Phos~P) + CheB(Phos~U) -&gt; T(Phos~U) + CheB(Phos~P) k_B_phos	CheBdephos: CheB(Phos~P) -&gt; CheB(Phos~U) k_B_dephos	#CheB demethylates receptor complex	#Rate dependent on methyaltion states	TbDemeth: T(Meth~B) + CheB(Phos~P) -&gt; T(Meth~A) + CheB(Phos~P) k_Tb_demeth	TcDemeth: T(Meth~C) + CheB(Phos~P) -&gt; T(Meth~B) + CheB(Phos~P) k_Tc_demeth	#Simulate exponentially increasing gradient	LAdd: L(t) -&gt; L(t) + L(t) addRate()end reaction rulesbegin compartments  EC  3  100       #um^3  PM  2  1   EC    #um^2  CP  3  1   PM    #um^3end compartmentsbegin species	@EC:L(t) L0	@PM:T(l,r,Meth~A,Phos~U) T0*0.84*0.9	@PM:T(l,r,Meth~B,Phos~U) T0*0.15*0.9	@PM:T(l,r,Meth~C,Phos~U) T0*0.01*0.9	@PM:T(l,r,Meth~A,Phos~P) T0*0.84*0.1	@PM:T(l,r,Meth~B,Phos~P) T0*0.15*0.1	@PM:T(l,r,Meth~C,Phos~P) T0*0.01*0.1	@CP:CheY(Phos~U) CheY0*0.71	@CP:CheY(Phos~P) CheY0*0.29	@CP:CheZ() CheZ0	@CP:CheB(Phos~U) CheB0*0.62	@CP:CheB(Phos~P) CheB0*0.38	@CP:CheR(t) CheR0end speciesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;1000, n_steps=&gt;500})Now save your file and run the simulation by clicking on the Run BNG button. The results will be saved in a new folder called addition/TIME contained in the current directory. Rename the folder from the timestamp to the value of k_add, 0.1.Open the newly created addition.gdat file and create a plot by clicking the Built-in plotting button.What happens to the concentration of phosphorylated CheY?Note: You can deselect AllLigand to make the plot of the concentration of phosphorylated CheY easier to see.Next, try the following few different values for k_add: 0.01, 0.03, 0.05, 0.1, 0.3, 0.5. What do these changing k_add values represent in the simulation? How does the system respond to the different values?Note: All of your simulation results are stored in the addition/TIME/ directory within your working directory. As you change the value of k_add, rename the directory with the k_add values instead of the timestamp for simplicity.You will observe that CheY phosphorylation drops gradually first, instead of the instantaneous sharp drop as we add lots of ligand at once. That means, with the ligand concentration increases, the cell is able to continuously lower the tumbling frequency.Visualizing the results of our simulationWe are now ready to fill in plotter_up.ipynb, a Jupyter notebook that we will use to visualize the outcome of our simulations.First specify the directories, model name, species of interest, and rates. Put the plotter_up.ipynb file inside the same folder as addition.bngl, or change the model_path below to point at this folder.#Specify the data to plot here.model_path = "addition"  #The folder containing the modelmodel_name = "addition"  #Name of the modeltarget = "phosphorylated_CheY"    #Target moleculevals = [0.01, 0.03, 0.05, 0.1, 0.3, 0.5]  #Gradients of interestWe next provide some import statements for needed dependencies.import numpy as npimport sysimport osimport matplotlib.pyplot as pltimport colorspaceTo compare the responses for different gradients, we color-code each gradient. Colorspace is one of the straight-forward ways to set up a color palette. Here we use a qualitative palette with hues (h) equally spaced between [0, 300], and constant chroma (c) and luminance (l) values.#Define the colors to usecolors = colorspace.qualitative_hcl(h=[0, 300.], c = 60, l = 70, pallete = "dynamic")(len(vals))The following function loads and parses the data. Once the file containing your data is loaded, we use the first row to investigate which column stores the concentration of the “target” observable species of interest. When we find that target, we will then access the time points and concentrations of this target molecule.def load_data(val):    file_path = os.path.join(model_path, str(val), model_name + ".gdat")    with open(file_path) as f:        first_line = f.readline() #Read the first line    cols = first_line.split()[1:] #Get the col names (species names)    ind = 0    while cols[ind] != target:        ind += 1                  #Get col number of target molecule    data = np.loadtxt(file_path)  #Load the file    time = data[:, 0]             #Time points    concentration = data[:, ind]  #Concentrations    return time, concentrationNow we will write a function to plot the time coordinates on the x-axis and the concentrations of the molecule at these time points on the y-axis. To do so, we will use the Matplotlib plot function to plot concentrations through time for each gradient value. Time-series data will be colored by the color palette we mentioned earlier.def plot(val, time, concentration, ax, i):    legend = "k = " + str(val)    ax.plot(time, concentration, label = legend, color = colors[i])    ax.legend()    returnThe plotting function above needs to be initialized with a figure defined by the subplot function. We loop through every gradient concentration to perform the plotting. Afterward, we define labels for the x-axis and y-axis, figure title, and tick lines. The call to plt.show() displays the plot.fig, ax = plt.subplots(1, 1, figsize = (10, 8))for i in range(len(vals)):    val = vals[i]    time, concentration = load_data(val)    plot(val, time, concentration, ax, i)plt.xlabel("time (s)")plt.ylabel("concentration (#molecules)")plt.title("Phosphorylated CheY vs time")ax.minorticks_on()ax.grid(b = True, which = 'minor', axis = 'both', color = 'lightgrey', linewidth = 0.5, linestyle = ':')ax.grid(b = True, which = 'major', axis = 'both', color = 'grey', linewidth = 0.8 , linestyle = ':')plt.show()Now run the notebook. How do changing values of k_add impact the CheY-P concentrations? Why do you think this is?In the main text, we will examine the results of our plots and discuss how they can be used to infer the cell’s behavior in the presence of increasing attractant.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Implementing the Gray-Scott Reaction-Diffusion Automaton",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/gs-jupyter",
        "date"     : "",
        "content"  : "The following tutorial will use a Jupyter Notebook to implement the Gray-Scott model. It requires a familiarity with Python, and installation instructions can be found in our coarse-grained diffusion tutorial. You may also download the completed tutorial file here.Assuming you have Jupyter notebook, create a new file called gray-scott.ipynb (you may instead want to duplicate and modify your diffusion_automaton.ipynb file from the diffusion tutorial).Note: You should make sure to save this notebook on the same level as another folder named /dif_images. ImageIO will not always create this folder automatically, so you may need to create it manually.At the top of the notebook, we need the following imports and declarations along with a specification of the simulate function that will drive our Gray-Scott simulation.import matplotlib.pyplot as pltimport numpy as npimport timefrom scipy import signalimport imageio%matplotlib inline'''Simulate functionDescription: Simulate the Gray-Scott model for numIter iterations.Inputs:    - numIter:  number of iterations    - A:        prey matrix    - B:        predator matrix    - f:        feed rate    - k:        kill rate    - dt:       time constant    - dA:       prey diffusion constant    - dB:       predator diffusion constant    - lapl:     3 x 3 Laplacian matrix to calculate diffusionOutputs:    - A_matrices:   Prey matrices over the course of the simulation    - B_matrices:   Predator matrices over the course of the simulation'''The Simulate function will take in the same parameters as the Diffuse function from the diffusion tutorial, but it will also take parameters f and k corresponding to the Gray-Scott feed and kill parameters, respectively. The simulation is in fact very similar to the diffusion notebook except for a very slight change that we make by adding the feed, kill, and predator-prey reactions when we update the matrices A and B containing the concentrations of the two particles over all the cells in the grid.images = []def Simulate(numIter, A, B, f, k, dt, dA, dB, lapl, plot_iter):    print("Running Simulation")    start = time.time()    # Run the simulation    for iter in range(numIter):        A_new = A + (dA * signal.convolve2d(A, lapl, mode='same', boundary='fill', fillvalue=0) - (A * B * B) + (f * (1-A))) * dt        B_new = B + (dB * signal.convolve2d(B, lapl, mode='same', boundary='fill', fillvalue=0) + (A * B * B) - (k * B)) * dt        A = np.copy(A_new)        B = np.copy(B_new)        if (iter % plot_iter is 0):            plt.clf()            plt.imshow((B / (A+B)),cmap='Spectral')            plt.axis('off')            now = time.time()            # print("Seconds since epoch =", now-start)            # plt.show()            filename = 'gs_images/gs_'+str(iter)+'.png'            plt.savefig(filename)            images.append(imageio.imread(filename))    return A, BThe following parameters will establish the grid size, the number of iterations we will range through, and where the predators and prey will start.# _*_*_*_*_*_*_*_*_* GRID PROPERTIES *_*_*_*_*_*_*_*_*_*grid_size = 101 # Needs to be oddnumIter = 5000;seed_size = 11 # Needs to be an odd numberA = np.ones((grid_size,grid_size))B = np.zeros((grid_size,grid_size))# Seed the predatorsB[int(grid_size/2)-int(seed_size/2):int(grid_size/2)+int(seed_size/2)+1, \int(grid_size/2)-int(seed_size/2):int(grid_size/2)+int(seed_size/2)+1] = \np.ones((seed_size,seed_size))The remaining parameters establish feed rate, kill rate, time interval, diffusion rates, the Laplacian we will use, and how often to draw a board to an image when rendering the animation.# _*_*_*_*_*_*_*_*_* SIMULATION VARIABLES *_*_*_*_*_*_*_*_*_*f = 0.055k = 0.117dt = 1.0dA = 1.0dB = 0.5lapl = np.array([[0.05, 0.2, 0.05],[0.2, -1.0, 0.2],[0.05, 0.2, 0.05]])plot_iter = 50After adding the code below to the bottom of the notebook, we are now ready to save our file and run the program to generate the animations.simulate(numIter, A, B, f, k, dt, dA, dB, lapl, plot_iter)imageio.mimsave('gs_images/gs_movie.gif', images)When you run your simulation, you should see an image analogous to the one in the diffusion simulation, but with much more complex behavior since we have added reactions to our model.  Try changing the feed and kill rate very slightly (e.g., by 0.01). How does this affect the end result of your simulation? What if you keep making changes to these parameters? do slight changes in the  should get images similar to the ones below.In the main text, we will discuss how much as we saw with the particle-based reaction-diffusion model, slight changes to the critical parameters in our model can produce vast differences in the beautiful patterns that emerge.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Using Homology Modeling to Predict the Structure of the SARS-CoV-2 Spike Protein",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_homology",
        "date"     : "",
        "content"  : "In this software tutorial, we will apply three popular software resources (SWISS-MODEL, Robetta, and GalaxyWEB) that use homology modeling to predict the structure of the SARS-CoV-2 spike protein. Recall from the main text that this protein is a homotrimer, meaning that it consists of three identical protein structures called chains. In what follows, we will predict the sequence of a single chain.The details of how the three software resources presented in this lesson differ are beyond the scope of our work in this course. If you are interested in understanding how they each implement homology modeling, then we suggest that you consult the documentation of the appropriate resource.SWISS-MODELTo run SWISS-MODEL, first download the sequence of the spike protein chain: SARS-CoV-2 spike protein chain.Next, go to the main SWISS-MODEL website and click Start Modeling.On the next page, copy and paste the sequence into the Target Sequence(s): box. Name your project and enter an email address to get a notification of when your results are ready. Finally, click Build Model to submit the job request. Note that you do not need to specify that you want to use the SARS-CoV spike protein as a template because the software will automatically search for a template for you.Your results may take between an hour and a day to finish depending on how busy the server is. (In the meantime, feel free to run the remaining software.) When you receive an email notification, follow the link provided and you can download the final models.When we ran our own job, SWISS-MODEL did indeed use one of the PDB entries of SARS-CoV spike protein as its template (PDB: 6crx) and correctly recognized that the template was a homotrimer. As a result, the software predicted a complete spike protein with all three chains included. An image of our results can be seen below. You can also download our results. We will discuss how to interpret these results and the .pdb file format when we return to the main text.Structures of the three models of this protein reported by SWISS-MODEL. The superimposed structure of all three models is shown on the bottom right.RobettaRobetta is a publicly available software resource that uses the same software as the distributed Rosetta@home project. As with SWISS-MODEL, we will provide Robetta a single chain of the SARS-CoV-2 spike protein.First, if you have not already done so, download the sequence of the chain: SARS-CoV-2 spike chain sequence.Next, visit Robetta and register for an account.Then, click Structure Prediction &gt; Submit.Create a name for the job, i.e. “SARS-CoV-2 Spike Chain”. Copy and paste the downloaded sequence into the Protein sequence box. Check CM only (for homology modeling), complete the arithmetic problem provided to prove you are human, and then click Submit.You should receive an email notification with a link to results after between an hour and a day. In our own run, unlike SWISS-MODEL, Robetta did not deduce that the input protein was a trimer and only predicted a single chain. The structure of the results from our own run of Robetta are shown in the figure below. You can also download our results if you like.The homology models produced by Robetta of one of the chains of the SARS-CoV-2 spike protein. The superimposition of all structures is shown on the bottom right.GalaxyWEBGalaxyWEB is a server with many available services for protein study, including protein structure prediction. GalaxyTBM (the template-based modeling service) uses HHsearch to identify up to 20 templates, and then matches the core sequence with the templates using PROMALS3D. Next, models are generated using MODELLERCSA.Because GalaxyWEB has a sequence limit of 1000 amino acids, we cannot use the entire spike protein chain. Instead, we will model the receptor binding domain (RBD) of the spike protein, which we introduced in the main text as a variable domain within the spike protein’s S1 subunit.First, download the sequence of the RBD.Then, visit the GalaxyWEB homepage. At the top, click Services &gt; TBM.Enter a job name, i.e. SARS-CoV-2 RBD. Enter an email address and then copy and paste the RBD sequence into the SEQUENCE box. Finally, click Submit.You should receive an email notification within a day with a link to your results. The results of our run of GalaxyWEB along with the validated structure of the SARS-CoV-2 RBD (PDB entry: 6lzg) are visualized in the figure below. You can also download our results if you like.Homology models predicted by GalaxyWEB for the SARS-CoV-2 spike protein RBD. The superimposition of all these structures is shown on the bottom right.Interpreting the results of our software runsIn the figures above, the structures predicted by the three software resources appear to be reasonably accurate. But throughout this course, we have prioritized using quantitative metrics to analyze results. As we return to the main text, our question is how to develop a quantitative metric to compare the results of these models to each other and to the structure of the SARS-CoV-2 spike protein.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Training a Classifier on an Image Shape Space",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/tutorial_image_classification",
        "date"     : "",
        "content"  : "Installing WekaIn this tutorial, we will apply the k-NN classifier to the post-PCA shape space of WBC nuclei images that we generated in the previous tutorial. To do so, we will need a statistical software framework that includes classification algorithms. There are a number of popular platforms available, but we will choose Weka, developed at the University of Waikato in New Zealand, since it is relatively light-weight and easy to get running quickly.To install Weka, follow the instructions provided at the Weka wiki.Converting a shape space fileTo convert our current PCA pipeline coordinates to a format to be used in Weka, we need to convert the WBC_PCA.csv file that we produced in the previous tutorial and that contains the coordinates of every image in the post-PCA shape space into the arff format used by Weka. If you have not completed the previous tutorial, or you would like to skip to the next section of this tutorial, we provide the completed file for download here.Open Weka and navigate to Tools --&gt; ArffViewer.Then navigate to File --&gt; Open.Change the Files of Type option to CSV data files.Find (or download) the WBC_PCA.csv file in your Step4_Visualization folder and click Open.Once all the data is loaded on screen, navigate to File --&gt; Save as ….Remove the .csv extension in the File Name field, and click Save.As a result, our PCA pipeline coordinates have now been converted to the file format that Weka accepts for further classification. This file should be saved as WBC_PCA.arff in the Step4_Visualization subfolder of the WBC_CellClass folder.Now that we have the PCA dataset in the correct format, click Exit to return to the Weka home screen.Running our first classifierYou should now be at the Weka GUI Chooser window that shows at the application’s startup. Under Applications, click Explorer to bring up the Weka Explorer window. This is the main window that we will use to run our classifier.Next, we need to load our WBC_PCA.arff file that we just created. At the top left of the window, click Open file... Navigate to the location of your WBC_PCA.arff file (the default location would be Desktop/WBC_PCAPipeline/Step4_Visualization). When we do so, we should see the data loaded into the window, as shown in the figure below.We want to ensure that Weka only considers the variables that are relevant for classifying the images by family. For this analysis, we won’t need the FILENAME name or the TYPE variables (if we were to include them, Weka would try to use them as one of the coordinates of our shape space vectors). So, click the checkboxes next to these two vectors, and click Remove to exclude them from consideration.Let’s classify! Click the Classify tab at the top of the explorer window. Near the top of the window you will see a button that says Choose, with ZeroR next to it. This button will allow us to select our classifier.If you’re curious what ZeroR means, it is the clown classifier from the main text that assigns every object to the class containing the most elements. Let’s not use this classifier! Instead, click Choose, which will bring up a menu of folders as shown below.The k-NN classifer is contained under lazy &gt; IBK. Select IBK, and you will be taken back to the explorer window, except that next to Choose you should now see IBK followed by a collection of parameters. The only parameter that we need for k-NN is the value of k (the number of nearest neighbors to consider when assigning a class to an object), which by default is set to 1 as indicated by -K 1.Under Test Options, we see Cross-validation is selected, which is what we want. Let us leave the number of folds equal to 10, the default value.Finally, beneath More options, we will see (Num) Var344. This is the variable that Weka will use to assign classes; rather, we would like Weka to classify objects by family. So, select this field, scroll up to the top, and select (Nom) FAMILY.Note: Here, Num indicates a numeric variable, and Nom indicates a nominal variable (meaning that it corresponds to a name).Now for the magic moment. Click Start. The classifier should run very quickly, and the results will show in the main window to the right and are reproduced below.The results are horrible! Every image in our dataset has been assigned as a lymphocyte. What could have gone wrong?Reducing the number of dimensions consideredRemember when we said that weird things happen in multi-dimensional space? The above result is one of those things. For some reason, every object in the dataset is closest to a lymphocyte. We could dig into the gritty details of the data to try and determine why this is the case, but instead, we will mutter something about the curse of dimensionality.When we used CellOrganizer to build a shape space with PCA, it produced a hyperplane with 344 dimensions (one fewer than the total number of images), which is far more than we need. The good news is that one of the features of PCA is that if we would instead like a hyperplane with some smaller number of dimensions d, then we only need to consider the first d coordinates of every point in the space.In our case, we will simply remove most of the variables under consideration by taking d = 20. To do so, click on the Preprocess tab. Under Attributes, select All, and then unselect FAMILY and the variables VAR1 through VAR20. Click Remove to ignore the other variables.Removing variables is always counterintuitive to a three-dimensional mind, but let us see what happens when we run the classifier again. Click the Classify tab, and you will see that (Num) Var20 is selected as the variable to use for classification. Select (Nom) FAMILY and click Start. In our run, this produces the following confusion matrix in the output window.            Granulocyte      Monocyte      Lymphocyte                  255      3      33              16      0      5              7      0      26      This is much better! The classifier seems to be performing particularly well on granulocytes. So, if removing some variables was a good thing, let’s remove a few more. Head back to Preprocess, remove Var16 through Var20, and run the classifier again. Our run yields the following updated confusion matrix.            Granulocyte      Monocyte      Lymphocyte                  252      8      31              18      2      1              4      1      28      We are getting a little better! If we remove Var11 through Var15, you can verify that we obtain the following confusion matrix.            Granulocyte      Monocyte      Lymphocyte                  259      9      23              14      6      1              5      2      26      In each step, our confusion matrix appears to be a little better, and the metrics that we introduced in the main text improve as well. In the Classifier output window, you can see that the accuracy has increased to 84.3%, while the weighted average of precision and recall over all three classes have increased to 0.857 and 0.843, respectively.All this dimension reduction may make us wonder how far we should take it  should we reduce everything down to a single dimension? Yet if we remove Var6 through Var10, we see that our confusion matrix gets a little worse:            Granulocyte      Monocyte      Lymphocyte                  261      13      17              13      8      0              16      2      17      And if we take the number of dimensions down to three, it gets a little worse still:            Granulocyte      Monocyte      Lymphocyte                  257      15      19              16      5      0              20      0      13      We have therefore replicated an instance of a very deep fact in data science, which is that there is typically a “Goldilocks” value in the number of dimensions we should use for our PCA hyperplane, at which the algorithm is performing optimally. In the case of this WBC image dataset, that sweet spot value is around 10.Note: If anything is still unclear about using Weka and exploring its output, Jen Golbeck made an excellent Youtube video that you may like to check out.STOP: There are two other considerations that we should take into account: the value of k in our k-nearest neighbors approach (which has defaulted to 1) and the number of folds f used (which has defaulted to 10). We encourage you to continue running the k-NN classifier for a few different values of k and f (which can range from 2 to 365). What do you find? Does it match your intuition? And what happens if we try a different classifier entirely?Subclassifying images by cell typeWe classified our WBC images by family, but granulocytes further subdivide into three classes (basophils, eosinophils, and neutrophils). This means that we could just as well have classified images into five categories corresponding to cell type.To do so, click the Preprocess tab at the top of the Weka explorer window. Click Open File again, and open your WBC_PCA.arff file. (It has not been modified by Weka.) This time, under Attributes, remove FILENAME, FAMILY, and the variables Var11 through Var344.Then, click Classify, and again run k-NN with k = 1 and the number of folds equal to 10, making sure to select (Nom) TYPE as your variable to classify. As we return to the main text, we ask you to reflect on the results.STOP: What are the accuracy, precision, and recall of this classifier? How does the confusion matrix compare to the one that we produced for three classes? From a biological perspective, why do you think that the algorithm is struggling?Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Hunting for Loops in Transcription Factor Networks",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/tutorial_loops",
        "date"     : "",
        "content"  : "In this tutorial, we will build a Jupyter Notebook to analyze loops in the E. coli transcription factor network, which can be downloaded here. If you would like to jump to the end of the analysis, you can download the complete Jupyter Notebook here.You will also need the following helper file:Python FileBefore running this tutorial, make sure that the following software and packages are installed.Warning: Be careful of the igraph installation and follow the website instructions carefully. When installing via pip or conda, specify “python-igraph” instead of “igraph”.            Installation Link      Version1      Check Install                  Python3      3.7      python –version              Jupyter Notebook      4.4.0      jupyter –version              python-igraph      0.8.0      conda list or pip list      Create a blank Jupiter notebook titled loops.ipynb and start editing this file below. First, we import the transcription factor network and see how many nodes and edges there are, as well as count the number of loops.# NOTE: when installing via pip or conda, install python-igraphfrom igraph import *from network_loader import *import randomtxt_file = 'network_tf_tf_clean.txt'network, vertex_names = open_network(txt_file)# how many nodes &amp; edgesprint("Number of nodes: ", len(network.vs))print("Number of edges: ", len(network.es))print("Number of self-loops: ", sum(Graph.is_loop(network)))If you run your notebook, you should obtain the following statistics.  Number of nodes:  197  Number of edges:  477  Number of self-loops:  130We can also create a visualization of the network by adding the following line of code to our network.plot(network, vertex_label=vertex_names, vertex_label_size=8,     edge_arrow_width=1, edge_arrow_size=0.5, autocurve=True)Running the notebook now produces the following network.Our plan is to compare this network against a random network. The following code will call a function from a package to generate a random network with 197 nodes and 477 edges and plot it. It uses a built in function called random.seed() that takes an integer as input and uses this function to initiate a (pseudo)random number generator that will allow us to generate a random network.  There is nothing special about the input value 42 here  or is there?random.seed(42)g = Graph.Erdos_Renyi(197,m=477,directed=True, loops=True)plot(g, edge_arrow_width=1, edge_arrow_size=0.5, autocurve=True)The resulting network is shown in the figure below.The question is how many edges and self-loops this network has, which is handled by the following code.# how many nodes &amp; edgesprint("Number of nodes: ", len(g.vs))print("Number of edges: ", len(g.es))print("Number of self-loops: ", sum(Graph.is_loop(g)))This code produces the following statistics for the random network.  Number of nodes:  197  Number of edges:  477  Number of self-loops:  5The number of self-loops is significantly lower in the random network compared to the real transcription factor network.STOP: Change the input integer to random.seed to any integer you like. How does it affect the number of nodes, edges, and self-loops? Try changing the input to a few different values.Regardless of what seed value we use, we can confirm that the number of self-loops expected in a random graph is significantly lower than in the real E. coli network. Back in the main text, we will discuss this significance and then see if we can determine why autoregulation has arisen.Return to main text            Other versions may be compatible with this code, but those listed are known to work for this tutorial. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Getting Started with BioNetGen and Modeling Ligand-Receptor Dynamics",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_lr",
        "date"     : "",
        "content"  : "This collection of tutorials will gradually build up from scratch a chemotaxis simulation using BioNetGen.In this tutorial, we will:  set up BioNetGen;  explore several key aspects of BioNetGen modeling: rules, species, simulation method, and parameters  use BioNetGen to model ligand-receptor dynamics and compute a steady state concentration of ligands and receptors.What is BioNetGen?In past modules, we have worked with chemical reactions that can be thought of as rules (e.g., “whenever an X particle and a Y particle collide, replace them with a single X particle”). The chemotaxis pathway also can be thought of as a set of biochemical rules specifying a set of mathematical equations dictating molecule concentrations. Our larger goal is to use BioNetGen to translate these rules into a reasonable chemotaxis simulation, then visualize and interpret the results.In this tutorial, we will focus only on modeling ligand-receptor dynamics, which we will use as a starting point for more advanced modeling later.Installation and setupBioNetGen features a convenient interface via the Microsoft Visual Studio Code editor. To see how to set up and install the necessary software, please visit their setup and installation page.Starting with Ligand-Receptor DynamicsIn this tutorial, we will build our model from scratch. If you like instead, you can download the completed simulation file here:ligand_receptor.bnglIn our system, there are only two types of molecules: the ligand (L), and the receptor (T). (The receptor is in fact a receptor complex because it is attached to additional molecules, which we will elaborate on later). The ligand can bind to the receptor, forming an intermediate, and the complex can also dissociate. We write this reaction as L + T &lt;-&gt; L.T, where the formation of the intermediate is called the forward reaction, and the dissociation is called the reverse reaction.In our system, which starts with a quantity of free ligands and receptors, the numbers of these free molecules should drop quickly, because free ligands and free receptors are readily to meet each other. After a while, there will be more L.T in the system and therefore more dissociation; at the same time, because free L and T are less abundant, less binding happens. The system will gradually reach a steady state where the rate of L and T binding equilibrates with L.T dissociation.We will simulate reaching this steady state, which means that we will need to know the following two parameters:  The rate of the forward reaction: k_lr_bind [L][T], where k_lr_bind is the rate constant.  The rate of the reverse reaction:  k_lr_dis[L.T], where k_lr_dis is the rate constant.Equilibrium is reached when k_lr_bind [L][T] = k_lr_dis[L.T]. Our goal in this tutorial is to use BioNetGen to determine this equilibrium in molecule concentrations as a proof of concept.To use BioNetGen, first create a folder called EColiSimulations in an appropriate location on your computer. Next, open VSCode, and select File &gt; Open Folder, and open the folder you just created.In this directory, create a new file by selecting the File &gt; New Text File. Save the file as  ligand_receptor.bngl. Now you should be able to start building our model in this file using the code that follows in this tutorial.Specifying molecule typesWe will specify everything needed for this tutorial, but if you are interested, reference BioNetGen documentation can be found here.To specify our model, add begin model and end model. Everything below regarding the specification of the model will go between these two lines.We will have two molecules corresponding to the ligand and receptor L and T that we denote L(t) and T(l), respectively. The (t) specifies that molecule L contains a binding site with T, and the (l) specifies a component binding to L. We will use these components later when specifying reactions.We will add the ligand and receptor molecules to our model under a molecule types section.begin modelbegin molecule types	L(t)	T(l)end molecule typesend modelSpecifying reaction rules and observablesAs discussed in the main text, the ligand-receptor simulation will only need to apply a single bi-directional reaction. To represent the bi-directional reaction A + B  C with forward rate k1 and reverse rate k2, we would write A + B &lt;-&gt; C k1, k2.Our model consists of one bidirectional reaction and will have a single rule. The left side of this rule will be L(t) + T(l); by specifying L(t) and T(l), we indicate to BioNetGen that we are only interested in unbound ligand and receptor molecules. If we had wanted to select any ligand molecule, then we would have used L + T.On the right side of the rule, we will have L(t!1).T(l!1), which indicates the formation of the complex. In BioNetGen, the symbol ! indicates formation of a bond, and a unique character specifies the possible location of this bond. In our case, we use the character 1, so that the bond is represented by !1. The symbol . is used to indicate that the two molecules are joined into a complex.Since the reaction is bidirectional, we will use k_lr_bind and k_lr_dis to denote the rates of the forward and reverse reactions, respectively. (We will specify values for these parameters later.)As a result, this reaction is shown below. We name our rule specifying the ligand-receptor reaction LR.LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis~~~ rubybegin reaction rules	LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_disend reaction rulesOnce we have specified reactions, we can define the molecules whose concentrations we are interested in tracking. These molecules are added to an  observables section.begin observables  Molecules free_ligand L(t)  Molecules bound_ligand L(t!l).T(l!l)  Molecules free_receptor T(l)end observablesInitializing unbound molecule countsNext, we need to specify a variable indicating the number of molecules with which we would like to initialize our simulation. We place these molecules within a species section. We are putting L0 unbound L molecules, and T0 unbound T molecules at the beginning; we will set these parameters later.Note that we do not specify an initial number of bound L.T complexes, meaning that the initial concentration of these complexes will be equal to zero.begin species	L(t) L0	T(l) T0end speciesSpecifying parametersNow we will declare all the parameters we introduced in the above sections. We will start with setting L0, the initial concentration of ligand, to 10,000, and T0, the initial concentration of receptors, to 7000. It remains to set the reaction rates for the forward and reverse reactions.BioNetGen is unitless, but for simplicity, we will assume that all concentrations are measured in the number of molecules per cell. The reaction rates are conventionally thought of in the units of mole (M) per second, where 1 M denotes Avogrado’s number, which is approximately 6.02 · 1023.Because of the differing units of molecules per cell and mole per second, we need to do some unit conversion here. The volume of an E. coli cell is approximately 1µm3, and so 1 mole per liter will correspond to 1 mole per 1015 µm3, or 6.02 · 108 molecules per cell.For bimolecular reactions, the rate constant should have unit M-1s-1, and we divide with NaV to convert to (molecules/µm3)-1)s-1. For monomolecular reactions, the rate constant have unit s-1, so no unit conversion is required.Although the specific numbers of cellular components vary among each bacterium, the components in chemotaxis pathway follows a relatively constant ratio. For all the simulations in this tutorial, we assign the initial number for each molecule and reaction rate by first deciding a reasonable range based on in vivo quantities 123. Our parameters are summarized below.begin parameters	NaV 6.02e8    #Unit conversion M -&gt; #/µm^3	L0 1e4        #number of ligand molecules	T0 7000       #number of receptor complexes	k_lr_bind 8.8e6/NaV   #ligand-receptor binding	k_lr_dis 35           #ligand-receptor dissociationend parametersNote: The parameters section has to appear before the reaction rules section.If you save your file, then you should see a “contact map” in the upper right corner of the window indicating the potential binding of L and T. This contact map is currently very simplistic, but for more complicated simulations it can help visualize the interaction of species in the system.Specifying simulation commandsWe are now ready to run our simulation. At the bottom of the model specification (i.e., after end model), we will add a generate_network and simulate command. The simulate command will take three parameters, which we specify below.Method. We will use method=&gt;"ssa" throughout these tutorials, which indicate that we are using the SSA (Gillespie) algorithm that was described in the main text. BioNetGen also includes the parameters method=&gt;"nf" (network-free) and method=&gt;"ode"(ordinary differential equations) that you can try. See the following article for more details if you are interested in these two approaches: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5079481.Time span.t_end, the simulation duration. BioNetGen simulation time is unitless; for simplicity, we assume our time unit is the second.Number of Steps. n_steps tells the program how many time points to break the simulation into when measuring the concentrations of our observables.	generate_network({overwrite=&gt;1})	simulate({method=&gt;"ssa", t_end=&gt;1, n_steps=&gt;100})The following code contains our complete simulation, which you can also download here:ligand_receptor.bngl.begin modelbegin molecule types	L(t)	T(l)end molecule typesbegin observables	Molecules free_ligand L(t)	Molecules bound_ligand L(t!l).T(l!l)	Molecules free_receptor T(l)end observablesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	L0 1e4        #number of ligand molecules	T0 7000       #number of receptor complexes	k_lr_bind 8.8e6/NaV2   #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociationend parametersbegin reaction rules	LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_disend reaction rulesbegin species	L(t) L0	T(l) T0end speciesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;1, n_steps=&gt;100})STOP: Based on our results from calculating steady state concentration by hand in the main text, predict how the concentrations will change and what the equilibrium concentrations will be.Running our simulationWe are now ready to run our simulation. To do so, click the Run BNG button on the top right of your VS Code window (see screenshot below). You should see a terminal appear on the bottom of the window, which will show you the progress of the simulation. The output of the program is stored in a folder located in the working directory, named ligand_receptor.To visualize the results, open the file ligand_receptor.gdat stored inside the folder. With this file open, click on the CLI Plotting button on the top right corner of the VS Code window.It is also possible to create an interactive plot from the results. Open the file ligand_receptor.gdat, and click on the “built-in plotting” botton located next to the “CLI Plotting” button you used to create the figure. The following interactive plot will be created:Is the result you obtain what you expected? In the main text, we will return to this question and then learn more about the details of bacterial chemotaxis in order to expand our BioNetGen model into one that fully reflects these details.Return to main text            Li M, Hazelbauer GL. 2004. Cellular stoichiometry of the components of the chemotaxis signaling complex. Journal of Bacteriology. Available online &#8617;              Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. &#8617;              Stock J, Lukat GS. 1991. Intracellular signal transduction networks. Annual Review of Biophysics and Biophysical Chemistry. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Finding Local Differences in the SARS-CoV and SARS-CoV-2 Spike Protein Structures",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_multiseq",
        "date"     : "",
        "content"  : "In this tutorial, we will get started with VMD and then calculate Qres between the SARS-CoV-2 RBD (PDB entry: 6vw1) and SARS-CoV RBD (PDB entry: 2ajf) using the VMD plugin Multiseq. By locating regions with low Qres, we can hopefully identify regions of structural differences between the two RBDs.Multiseq aligns two protein structures using a tool called Structural Alignment of Multiple Proteins (STAMP). Much like the Kabsch algorithm considered in part 1 of the module, STAMP minimizes the distance between alpha carbons of the aligned residues for each protein or molecule by applying rotations and translations. If the structures do not have common structures, then STAMP will fail. For more details on the algorithm used by STAMP, click here.Note: As of the current time, the STAMP alignment step used by Multiseq is not working for most users, especially Windows users. We will update this module when the software is fixed.Getting startedFor this tutorial, first download VMD. Throughout this tutorial, the program may prompt you to download additional protein database information, which you should accept.We will need to download the .pdb files for 6vw1 and 2ajf. Visit the 6vw1 and 2ajf PDB pages. For each protein,  click Download Files and select PDB Format. The following figure shows this for 6vw1.Aligning the RBD regions of two spike proteinsNext, launch VMD, which will open three windows. We will not use VMD.exe, the console window, in this tutorial. We will load molecules and change visualizations in VMD Main. Finally, we will use OpenGL Display to display our visualizations.We will first load the SARS-CoV-2 RBD (6vw1) into VMD. In VMD Main, go to File &gt; New Molecule. Click Browse, select your downloaded file (6vw1.pdb) and click Load.The molecule should now be listed in VMD Main, with its visualization in OpenGL Display.In the OpenGL Display window, you can click and drag the molecule to change its orientation. Pressing ‘r’ on your keyboard allows you to rotate the molecule, pressing ‘t’ allows you to translate the molecule, and pressing ‘s’ allows you to enlarge or shrink the molecule (or you can use your mouse’s scroll wheel). Note that left click and right click have different actions.We now will need to load the SARS-CoV RBD (2ajf). Repeat the above steps for 2ajf.pdb.After both molecules are loaded into VMD, start up Multiseq by clicking on Extensions &gt; Analysis &gt; Multiseq.You will see all the chains listed out per file. Both PDB files contain two biological assemblies of the structure. The first is made up of Chain A (ACE2) and Chain E (RBD), and the second is Chain B (ACE2) and Chain F (RBD). Because Chain A is identical to Chain B, and Chain E is identical to Chain F, we only need to work with one assembly. (We will use the second assembly.)Because we only want to compare the RBD, we will only keep chain F of each structure. To remove the other chains, select the chain and click Edit &gt; Cut.Click Tools &gt; Stamp Structural Alignment, and a new window will open up.Keep all the values and click OK; once you have done so, the RBD regions will have been aligned.Visualizing a structural alignmentNow that we have aligned the two RBD regions, we would like to compare their Qres values over the entire RBD. To see a coloring of the protein alignment based on Qres, click View &gt; Coloring &gt; Qres.Blue indicates a high value of Qres, meaning that the protein structures are similar at this position; red indicates low Qres and dissimilar protein structures.The OpenGL Display window will now color the superimposed structures according to the value of Qres.We are looking for regions of consecutive amino acids having low Qres, which correspond to locations in which the coronavirus RBDs differ structurally. You may like to explore the alignments yourself to look for regions of interest before we head back to the main text and discuss our results.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Comparing Simple Regulation to Negative Autoregulation",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/tutorial_nar",
        "date"     : "",
        "content"  : "Note: We are currently in the process of updating this tutorial to the latest version of MCell, CellBlender, and Blender. This tutorial works with MCell3, CellBlender 3.5.1, and Blender 2.79. Please see a previous tutorial for a link to download these versions.Implementing simple regulation in CellBlenderIn this tutorial, we will compare simple against negative autoregulation using a particle-based simulation in CellBlender. We will start with simple regulation; if you followed the prologue, then load your CellBlender_Tutorial_Template.blend file; otherwise, follow the steps indicated in the prologue’s Random Walk Tutorial to produce this file. Then, save a copy of this file as NAR_comparison.blend. You may also download the completed tutorial files here.Then go to CellBlender &gt; Molecules and create the following molecules:  Click the + button.  Select a color (such as yellow).  Name the molecule Y1.  Select the molecule type as Surface Molecule.  Add a diffusion constant of 1e-6.  Up the scale factor to 5 (click and type “5” or use the arrows).Repeat the above steps as needed to make sure that both of the following molecules are entered with the following parameters.            Molecule Name      Molecule Type      Diffusion Constant      Scale Factor                  Y1      Surface      1e-6      5              X1      Surface      1e-6      1      Now go to CellBlender &gt; Molecule Placement to set the following sites to release our molecules:  Click the + button.  Select or type in the molecule X1.  Type in the name of the Object/Region Plane.  Set the Quantity to Release as 300.Finally, we set reactions. Go to CellBlender &gt; Reactions and define the following reactions:  Click the + button.  Under reactants, type X1’ (note the apostrophe).  Under products, type X1’ + Y1’.  Set the forward rate as 2e2.Repeat the above steps as needed to ensure the following reactions are present.            Reactants      Products      Forward Rate                  X1’      X1’ + Y1’      4e2              Y1’      NULL      4e2      Go to CellBlender &gt; Plot Output Settings to ensure that we will be able to plot the concentrations of our particles over time.  Click the + button.  Set the molecule name as Y1.  Ensure World is selected.  Ensure Java Plotter is selected.  Ensure One Page, Multiple Plots is selected.  Ensure Molecule Colors is selected.We are ready to run our simulation! Visit CellBlender &gt; Run Simulation and select the following options:  Set the number of iterations to 20000.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.Once the simulation has run, click CellBlender &gt; Reload Visualization Data to visualize the outcome.You have the option of watching the animation within the Blender window by clicking the play button at the bottom of the screen.Now return to CellBlender &gt; Plot Output Settings and scroll to the bottom to click Plot.You should be able to see Y reach a steady state, at which the number of particles essentially levels off subject to some noise.Save your .blend file.Adding negative auto-regulation to the simulationNow that we have simulated simple regulation, we will implement negative autoregulation in CellBlender to compare how this system reaches steady state compared to the simple regulation system.Go to CellBlender &gt; Molecules and create the following molecules:  Click the + button.  Select a color (such as yellow).  Name the molecule Y2.  Select the molecule type as Surface Molecule.  Add a diffusion constant of 1e-6.  Up the scale factor to 5 (click and type “5” or use the arrows).Repeat the above steps to make sure that we have all of the following molecules (X1 and Y1 are inherited from the simple regulation simulation).            Molecule Name      Molecule Type      Diffusion Constant      Scale Factor                  Y1      Surface      1e-6      5              X1      Surface      1e-6      1              Y2      Surface      1e-6      5              X2      Surface      1e-6      1      Now go to CellBlender &gt; Molecule Placement to set the following molecule release sites:  Click the + button.  Select or type in the molecule X2.  Type in the name of the Object/Region Plane.  Set the Quantity to Release as 300.You should now have the following release sites.            Molecule Name      Object/Region      Quantity to Release                  X1      Plane      300              X2      Plane      300      Next go to CellBlender &gt; Reactions to create the following reactions:  Click the + button.  Under reactants, type X2’ (the apostrophe is important).  Under products, type X2’ + Y2’.  Set the forward rate as 2e2.Repeat the above steps as needed to ensure that you have the following reactions.            Reactants      Products      Forward Rate                  X1’      X1’ + Y1’      4e2              X2’      X2’ + Y2’      4e2              Y1’      NULL      4e2              Y2’      NULL      4e2              Y2’ + Y2’      Y2’      4e2      Go to CellBlender &gt; Plot Output Settings to set up a plot as follows:  Click the + button.  Set the molecule name as Y2.  Ensure World is selected.  Ensure Java Plotter is selected.  Ensure One Page, Multiple Plots is selected.  Ensure Molecule Colors is selected.You should now have both Y1 and Y2 plotted.            Molecule Name      Selected Region                  Y1      World              Y2      World      We are now ready to run the simulation comparing simple regulation and negative autoregulation. To do so, go to CellBlender &gt; Run Simulation and do the following:  Set the number of iterations to 20000.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.Click CellBlender &gt; Reload Visualization Data to visualize the result of the simulation.You have the option of watching the animation within the Blender window by clicking the play button at the bottom of the screen.Now return to CellBlender &gt; Plot Output Settings and scroll to the bottom to click Plot.A plot should appear in which the plot of Y over time assuming simple regulation is shown in red, and the plot of Y if negatively autoregulated is shown in yellow.Save your file.Comparing simple regulation and negative autoregulationSTOP: Now that you have run the simulation comparing simple regulation and negative autoregulation, are the plots of Y for the two simulations what you would expect? Why or why not?If you find the outcome of the simulation in this tutorial confusion, don’t be concerned. In the main text, we will interpret this outcome and see if it allows us to start making conclusions about why negative autoregulation has evolved, or if we will need to further tweak our model.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Ensuring a mathematically controlled simulation for comparing simple regulation to negative autoregulation",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/tutorial_nar_mathematically_controlled",
        "date"     : "",
        "content"  : "Note: We are currently in the process of updating this tutorial to the latest version of MCell, CellBlender, and Blender. This tutorial works with MCell3, CellBlender 3.5.1, and Blender 2.79. Please see a previous tutorial for a link to download these versions.In this tutorial, we will use CellBlender to adapt our simulation from the tutorial on negative autoregulation into a mathematically controlled simulation.First, open the file NAR_comparison.blend from the negative autoregulation tutorial and save a copy of the file as NAR_comparison_equal.blend. You may also download the completed tutorial files here.Now go to CellBlender &gt; Reactions to scale up the simple regulation reaction in the negative autoregulation simulation as follows: for the reaction X2’ -&gt; X2’ + Y2’,  change the forward rate from 4e2 to 4e3.Next go to CellBlender &gt; Run Simulation and ensure that the following options are selected:  Set the number of iterations to 20000.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.Click CellBlender &gt; Reload Visualization Data. You have the option of watching the animation within the Blender window by clicking the play button at the bottom of the screen.Now go back to CellBlender &gt; Plot Output Settings and scroll to the bottom to click Plot; this should produce a plot. How does your plotSave your file before returning to the main text, where we will interpret the plot produced to see if we were able to obtain a mathematically controlled simulation and then interpret the result of this simulation from an evolutionary perspective.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Segmenting Nuclei from Cellular Images",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/tutorial_nuclear_segmentation",
        "date"     : "",
        "content"  : "Installing R and RStudioTo run our segmentation pipeline, we will use R, a free programming language that is popular among data scientists across disciplines. We will also use RStudio, an integrated development environment that makes working with R easy.You can download R and RStudio from their respective home sites or follow the instructions at Hands-On Programming with R.Obtaining the WBC Image DataThe raw images and files that we will need to run our analysis are contained in a WBC_PCAPipeline folder that can be downloaded here as a .zip. Extract this file, move the resulting folder onto your desktop, and verify that it has the following contents:WBC_PCAPipeline	Data		RawImgs			BloodImage_00001.jpg			·			·			·			BloodImage_00410.jpg		WBC_Labels.csv	Step1_Segmentation		WBC_imgSeg.R	Step2_Binarization		WBC_imgBin.m	Step3_PCAModel		WBC_PCAModel.m	Step4_ShapeSpaceVisualization		WBC_SS_CellClass.py		WBC_SS_CellType.py	Step5_Classification	README.pdfNote: Asking you to place this directory of files onto your desktop is unconventional. If you place it elsewhere, then you will have to manually change all file paths in the tutorials that follow in order to point to the appropriate directory. However, we have noticed occasional software glitches with using setwd() or pwd() if not specifying a universal location.You may like to take a look at the Data folder, which contains the WBC images that we will work with in RawImgs. The other folders contain files that we will use to run different aspects of our analysis, starting with segmentation of the nuclei from the WBC images.Segmenting Nuclei from WBC ImagesIn the main text, we stated that we would segment the nuclei from our WBC images by binarizing the image based on color. The nucleus shows up as bluish, so that our idea is to color a pixel white if it has above a certain threshold level of blue, below a threshold amount of red, and below a threshold amount of green.Open RStudio, navigate to File --&gt; Open File, and find Desktop/WBC_PCAPipeline/Step1_Segmentation/WBC_imgSeg.R. You should see the WBC_imgSeg.R file appear on the left side of the RStudio window.The first few lines of WBC_imgSeg.R refer to a collection of three packages (or libraries) that we need to install in order to run segmentation pipeline. Two of these packages (jpeg and tiff) are contained in R, and the third (EBImage) is installed from the Bioconductor project as part of the BiocManager package. These package installations correspond to the following lines of our R file.install.packages("jpeg")install.packages("tiff")install.packages("BiocManager")BiocManager::install("EBImage")Place your cursor on the first line of the file, and click Run, which will install the jpeg package along with any packages upon which it depends.You can then click Run three more times to install each of the other three packages.Note: Should you be asked in the RStudio console about upgrading dependencies during the EBImage library installation, type a and hit enter. Also, after you run this file once, should you decide to run this file again, you will not need to run the package installations and can comment them out using #.Now that we have installed the required packages, we indicate to R that we want to use each of the three packages that we just installed in this file. Run each of the following three lines.# Required Librarieslibrary("EBImage")library("jpeg")library("tiff")Next, we run the following lines, which will create a directory counter that will keep track of how many files we have processed thus far.# dir Counteri = 1We will now run commands to set paths for raw images and segmented images. segImgs will contain all of the segmented nuclei images in which the WBC nucleus is white and the rest of the image is in black. colImgs will contain all of the segmented nucleus images in which the nucleus retains its original color (and the background of the image is black). Finally, BWImgs will store binarized versions of our segmented images (more on this later).# Set path for raw image filespath="~/Desktop/WBC_PCAPipeline/Data/"rawImgs=paste(path, "RawImgs/", sep="")# Set up directory path for segemented imagessegImgs=paste(path, "SegImgs_", sep="")colImgs=paste(path, "ColNuc_", sep="")bwImgs=paste(path, "BWImgs_", sep="")Next, we run commands to set up directories and print some messages to the console regarding the creation of these directories.# Check if unique seg directory exists, otherwise create onewhile (file.exists(paste(segImgs, toString(i), sep=""))) {  i = i + 1}print(noquote(paste("Creating", paste(segImgs, toString(i), sep=""), "directory for segmented images.")))dir.create(paste(segImgs, toString(i), sep=""), showWarnings = FALSE);print(noquote(paste("Creating", paste(bwImgs, toString(i), sep=""), "directory for binarized images.")))dir.create(paste(bwImgs, toString(i), sep=""), showWarnings = FALSE)print(noquote(paste("Creating", paste(colImgs, toString(i), sep=""), "directory for nucleus in color images.")))dir.create(paste(colImgs, toString(i), sep=""), showWarnings = FALSE)outDir=paste(segImgs, toString(i), sep="")setwd(rawImgs)# Gather all files within the directory aboveall.files &lt;- list.files()my.files &lt;- grep("*.jpg", all.files, value=T)print(noquote("Starting nucleus segmentation..."))Finally, the engine of our work is a function that processes and segmentes every image individually according to the thresholding that we discussed above. Note that in the following code, a pixel is only retained if its red value is less than 65%, its green value is less than 60%, and its blue value is above 59.75% (see the values of r_threshold, g_threshold, and b_threshold).We should only need to run the first line of the following code, since R will automatically perform everything inside this “for loop” for as many files as are in our dataset. You should not feel obligated to consult the following lines unless you are interested. The only thing that this code does that we have not discussed is that after segmentation, it identifies contiguous objects in the image and removes any objects whose area is smaller than some threshold, which allows us to remove any small artifacts from the image.# Loop through each file and process each image individuallyfor (i in my.files) {  print(noquote(paste("Segmenting nucleus from file", i)))  # Read the image, change to its directory  nuc = readImage(paste(rawImgs, i, sep=""))  # Each nuclear stain has low red, low green and high blue.  # Need to invert the red and green channels, and then threshold according to  # the above criteria  nuc_r = channel(nuc, 'r')  nuc_g = channel(nuc, 'g')  nuc_b = channel(nuc, 'b')  # Assigned thresholds for low red, low green, and high blue.  r_threshold = 0.65  g_threshold = 0.60  b_threshold = 0.5975  # Apply the thresholds accordingly  nuc_rTH = nuc_r &lt; r_threshold  nuc_gTH = nuc_g &lt; g_threshold  nuc_bTH = nuc_b &gt; b_threshold  nucleusComp = nuc_rTH &amp; nuc_gTH &amp; nuc_bTH  nucleusBW = bwlabel(fillHull(nucleusComp))  # Compute features for objects that have made the threshold boundaries  features = computeFeatures.shape(nucleusBW)  # Assigned area threshold  area_threshold &lt;- 1500  # Find all features that do not meet the area threshold  indices = which(features &lt; area_threshold)  nucleusFin = rmObjects(nucleusBW, indices)  newFeatures = computeFeatures.shape(nucleusFin)  # Write final nucleus image to disk  filename = paste(outDir, i, sep="/")  writeImage(nucleusFin, filename)}Finally, we print that we are finished. If we see this command printed to the console, then we know that we are finished.print(noquote("DONE!"))Note: If you run the file multiple times, three directories are created each time within the Data folder with the form of SegImgs_i, ColNuc_i, and BWImgs_i, where i is an integer. The images are only segmented into the most recently created directories (those with the largest value of i). Should you run into trouble and need to run this file multiple times, ensure that future file paths are pointing to the right folders!After we have run our R file, you will notice the creation of three directories of the form: SegImgs_1, ColNuc_1, and BWImgs_1 within the Data folder. If the run completed correctly, then you should see the segmented images in SegImgs_1, like the image shown in the figure below. However, these images are not technically binarized because they exist in grayscale, in which each pixel receives a value between 0 (black) and 255 (white).The greyscale segmented nucleus of BloodImage_00001.jpg.Binarizing Segmented ImagesWe have successfully segmented our images, but we would like to ensure that these images are truly binarized, so that each pixel is either 0 (black) or 1 (white). Furthermore, the CellOrganizer approach that we will consider in a future tutorial requires all images to be in TIFF format, and this step will handle that file conversion as well by running a MATLAB pipeline.Note: The current version of CellOrganizer that we will use in future tutorials is a free distribution provided as an add-on to MATLAB, which is paid software. We felt that MATLAB was the easiest way to run the binarization pipeline below as well. We are in the process of investigating a way to run all of the tutorials in this module without needing paid software.First, you will need the latest version of MATLAB. Then, open MATLAB and run the following commands in the MATLAB command window:clearclccd ~/Desktop/WBC_PCAPipeline/Step2_BinarizationWBC_imgBinAs a result, the BWImgs_1 directory will now contain binarized TIFF versions of the segmented images. Furthermore, the ColNuc_1 directory will now contain TIFF versions of the segmented images like the one below, such that the nucleus is in color and the background is in black. We will not be using these images in future tutorials, but they provide an indication that our segmentation was mostly successful.Nuclear segmentation of BloodImage_00001.jpg with color retained in the nucleus.STOP: Before we return to the main text, try running the segmentation pipeline on a few different values of r_threshold, g_threshold, and b_threshold to see how they change the segmentation results. How could we quantify whether one collection of parameters is better than another? (You should use the values above the last time that you run the R pipeline so that your results will match those in future tutorials.)Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Implementing the Repressilator",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/tutorial_oscillators",
        "date"     : "",
        "content"  : "Note: We are currently in the process of updating this tutorial to the latest version of MCell, CellBlender, and Blender. This tutorial works with MCell3, CellBlender 3.5.1, and Blender 2.79. Please see a previous tutorial for a link to download these versions.In this tutorial, we will use CellBlender to build a particle-based simulation implementing a repressilator. First, load the CellBlender_Tutorial_Template.blend file from the Random Walk Tutorial and save a copy of the file as repressilator.blend. You may also download the completed tutorial file here.Then go to CellBlender &gt; Molecules and create the following molecules:  Click the + button.  Select a color (such as yellow).  Name the molecule Y.  Select the molecule type as Surface Molecule.  Add a diffusion constant of 1e-6.  Up the scale factor to 5 (click and type “5” or use the arrows).Repeat the above steps to make sure that the following molecules are all entered with the appropriate parameters.            Molecule Name      Molecule Type      Diffusion Constant      Scale Factor                  X      Surface      4e-5      5              Y      Surface      4e-5      5              Z      Surface      4e-5      5              HiddenX      Surface      3e-6      3              HiddenY      Surface      3e-6      3              HiddenZ      Surface      3e-6      3              HiddenX_off      Surface      1e-6      3              HiddenY_off      Surface      1e-6      3              HiddenZ_off      Surface      1e-6      3      Now go to CellBlender &gt; Molecule Placement to establish molecule release sites by following these steps:  Click the + button.  Select or type in the molecule X.  Type in the name of the Object/Region Plane.  Set the Quantity to Release as 150.Repeat the above steps to make sure the following molecules are entered with the appropriate parameters as shown below.            Molecule Name      Object/Region      Quantity to Release                  X      Plane      150              HiddenX      Plane      100              HiddenY      Plane      100              HiddenZ      Plane      100      Next go to CellBlender &gt; Reactions to create the following reactions:  Click the + button.  Under reactants, type HiddenX’ (note the apostrophe).  Under products, type HiddenX’ + X’.  Set the forward rate as 2e3.Repeat the above steps for the following reactions, ensuring that you have the appropriate parameters for each reaction.Note: Some molecules require an apostrophe or a comma. This represents the orientation of the molecule in space and is very important to the reactions!            Reactants      Products      Forward Rate                  HiddenX’      HiddenX’ + X’      2e3              HiddenY’      HiddenY’ +Y’      2e3              HiddenZ’      HiddenZ’ + Z’      2e3              X’ + HiddenY’      HiddenY_off’ + X,      6e2              Y’ + HiddenZ’      HiddenZ_off’ + Y,      6e2              Z’ + HiddenX’      HiddenX_off’ + Z,      6e2              HiddenX_off’      HiddenX’      6e2              HiddenY_off’      HiddenY’      6e2              HiddenZ_off’      HiddenZ’      6e2              X’      NULL      6e2              Y’      NULL      6e2              Z’      NULL      6e2              X,      X’      2e2              Y,      Y’      2e2              Z,      Z’      2e2      Go to CellBlender &gt; Plot Output Settings to build a plot as follows:  Click the + button.  Set the molecule name as X.  Ensure World is selected.  Ensure Java Plotter is selected.  Ensure One Page, Multiple Plots is selected.  Ensure Molecule Colors is selected.Repeat the above steps for the following molecules.            Molecule Name      Selected Region                  X      World              Y      World              Z      World      We are now ready to run our simulation. Go to CellBlender &gt; Run Simulation and select the following options:  Set the number of iterations to 120000.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.Once the simulation has run, visualize the results of the simulation with CellBlender &gt; Reload Visualization Data.Now go back to CellBlender &gt; Plot Output Settings and scroll to the bottom to click Plot.Does the plot that you obtain look like a biological oscillator? As we return to the main text, we will interpret this plot and then see what will happen if we suddenly shift the concentration of one of the particles. Will the system still retain its oscillations?Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Perturbing the Repressilator",
        "category" : "",
        "tags"     : "",
        "url"      : "/motifs/tutorial_perturb",
        "date"     : "",
        "content"  : "In this tutorial, we will see what happens when we make a sudden change to the concentration of one of the repressilator particles in the middle of the simulation. This is difficult to do with CellBlender, and so we will instead use this opportunity to transition to a “particle-free” tool called NFSim that does have the desired functionality. We will say much more about particle-free modeling, in which we do not have to track the movements of individual particles to track their concentrations, in a future module.First, you will need to install NFSim and a program called RuleBender, which we will use as a GUI for NFSim. Those two programs can be installed here. You may also download the completed tutorial file here.We will first build a simulation of the repressilator that we will perturb later. Assuming you have installed RuleBender, open the RuleBender program and select File &gt; New BioNetGen Project.Select blank_file.bngl and name your project oscillators.Note: Occasionally the following error will pop up to inform the user: “There was a failure during the copy of the sample”. The folder will be created, but no files will be loaded. Select File &gt; New &gt; File to create a new blank file.Rename your file oscillator_copy.bngl and double-click the file in the navigator to open the editor window. Once in the editor window, add the following parameters:begin parameters r1 2e3 r2 6e2 r3 6e2 r4 2e2 r5 6e2end parametersNext, add the molecules used as follows:begin molecule types x(Y~U~P) y(Y~U~P) z(Y~U~P) hx() hy() hz() hx_off() hy_off() hz_off() null()end molecule typesNext, specify the quantities of each molecule at the start of the simulation:begin species x(Y~U) 150 y(Y~U) 0 z(Y~U) 0 hx() 100 hy() 100 hz() 100 hx_off() 0 hy_off() 0 hz_off() 0 null() 0end speciesTo view a plot of the molecules after the simulation is complete, add the following code:begin observables Molecules X x() Molecules Y y() Molecules Z z()end observablesThe following rules and reaction parameters are the same reaction rules as used in the CellBlender tutorial on the repressilator.begin reaction rules # x copy hx() -&gt; hx() + x(Y~U) r1 x(Y~U) + hy() -&gt; hy_off() + x(Y~P) r2 hy_off() -&gt; hy() r3 x(Y~P) -&gt; x(Y~U) r4 x() -&gt; null() r5 # y copy hy() -&gt; hy() + y(Y~U) r1 y(Y~U) + hz() -&gt; hz_off() + y(Y~P) r2 hz_off() -&gt; hz() r3 y(Y~P) -&gt; y(Y~U) r4 y() -&gt; null() r5 # z copy hz() -&gt; hz() + z(Y~U) r1 z(Y~U) + hx() -&gt; hx_off() + z(Y~P) r2 hx_off() -&gt; hx() r3 z(Y~P) -&gt; z(Y~U) r4 z() -&gt; null() r5end reaction rulesFinally, specify the type of simulation and number of frames to run using the following code.# i.e. 12,000 frames at 1e-6 timestep on CellBlendersimulate_nf({t_end=&gt;.06,n_steps=&gt;60000});Then, save your file.On the right-hand side, click Simulation &gt; Run to run the simulation. After the simulation is complete, a new window will appear showing the plotted graph. As we can see, this appears to be the same behavior as the CellBlender plot but with a much cleaner pattern (this is because we do not have the noise incurred by having individual particles).We will now perturb the file and test the robustness of this oscillator model.In the Navigator window, right click oscillator_copy.bngl and copy the file. Paste a copy in the same folder and rename your file to oscillator_perturb.bngl.Add the following parameters to the parameters section of the file: # delay mechanic r6 1e7 r7 4e2 r8 1e3 r9 2e4 r10 1e3Then add the following molecules to the molecules section: # delay mechanic delay() a(Y~U~P) b() null()Add the following to species: # Delay mechanic delay() 100 a(Y~U) 1000 b() 0 null() 0Optional: add the following to observables: Molecules D delay() Molecules A a() Molecules B b()Finally, add the following to reaction rules.  These rules act as a delayed spike to the y() molecule. Once the delay() molecule has sufficiently decayed into null(),  the a() molecule will begin producing the b() molecule, which will in turn produce the y() molecule, disrupting our initial oscillations with a large influx of y(). # delay rules delay() + a(Y~P) -&gt; delay() + a(Y~U) r6 delay() -&gt; null() r7 a(Y~U) -&gt; a(Y~P) r8 a(Y~P) -&gt; b() r9 b() -&gt; y(Y~U) r10On the right side of the window, click Simulation &gt; Run. After the simulation is complete, a new window will appear showing the plotted graph.Can you break the oscillator model, or is it just too robust? We recommend playing around with the reaction rules for b()  which other species could it produce? You could also adjust the starting quantities for a(Y~U~P) or change the rate at which the delay() molecule decays.In the main text, we will discuss the robustness of the repressilator and make a larger point about robustness in biology before we complete our work in this module.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Adding Phosphorylation to our BioNetGen Model",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_phos",
        "date"     : "",
        "content"  : "In this tutorial, we will extend the BioNetGen model covered in the ligand-receptor tutorial to add the phosphorylation chemotaxis mechanisms described in the main text, shown in the figure reproduced below.To get started, open Visual Studio Code, and click File &gt; Open Folder.... Open the EColiSimulations folder from the previous tutorial. Create a copy of your file from the ligand-receptor tutorial and save it as phosphorylation.bngl. If you would rather not follow along below, you can download a completed BioNetGen file here:phosphorylation.bnglDefining moleculesSay that we wanted to specify all particles and the reactions involving them in the manner used up to this point in the book. We would need one particle type to represent MCP molecules, another particle type to represent ligands, and a third to represent bound complexes. A bound complex molecule binds with CheA and CheW and can be either phosphorylated or unphosphorylated, necessitating two different molecule types. In turn, CheY can be phosphorylated or unphosphorylated as well, requiring two more particles.Instead, the BioNetGen language will allow us to conceptualize this system much more concisely using rules that apply to particles that are in a variety of states (we will say more about the paradigm of using rules soon in the main text). The BioNetGen representation of the four particles in our model is shown below. The notation Phos~U~P indicates that a given molecule type can be either phosphorylated or unphosphorylated, so that we do not need multiple different expressions to represent the molecule. We also add molecules CheY(Phos~U~P) and CheZ().L(t)             #ligand moleculeT(l,Phos~U~P)    #receptor complexCheY(Phos~U~P)CheZ()Note: Be careful with the use of spaces; don’t put spaces after the comma in the specification of the receptor.)During this simulation, we are interested in tracking the concentration of phosphorylated CheY and CheA (receptor complex) along with the concentration of the ligand.begin observables	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules phosphorylated_CheA T(Phos~P)	Molecules bound_ligand L(t!1).T(l!1)end observablesDefining reactionsThe conciseness of BioNetGen’s molecule representation helps us represent our reactions concisely as well. We first reproduce the reversible binding and dissociation reaction from the previous lesson.LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_disNext, we represent the phosphorylation of the MCP complex. Recall that the phosphorylation of CheA can occur at different rates depending on whether the MCP is bound, and so we will need two different reactions to model these different rates. In our model, the phosphorylation of the MCP will occur at one fifth the rate when it is bound to the attractant ligand.FreeTP: T(l,Phos~U) -&gt; T(l,Phos~P) k_T_phosBoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*0.2Finally, we represent the phosphorylation and dephosphorylation of CheY. The former requires a phosphorylated MCP receptor, while the latter is done with the help of a CheZ molecule that can be in any state.YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phosYDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephosWe use the snippets above to create a complete set of reaction rules for our simulated system.begin reaction rules	LigandReceptor: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis	#Free vs. ligand-bound complexes autophosphorylates	FreeTP: T(l,Phos~U) -&gt; T(l,Phos~P) k_T_phos	BoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*0.2	YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephosend reaction rulesInitializing molecules and parametersTo initialize our simulation, we need to indicate the number of molecules in each state present at the beginning of the simulation. Since we are adding ligands at the beginning of the simulation, the initial amount of molecules at each same state should be equal to the equilibrium concentrations when no ligand is present.  To this end, we set the amount of phosphorylated receptor equal to one-fourth the concentration of unphosphorylated receptor, and the concentration of phosphorylated CheY to be equal to the concentration of unphosphorylated CheY.Note: This was validated through trial and error.begin species	L(t) L0	T(l,Phos~U) T0*0.8	T(l,Phos~P) T0*0.2	CheY(Phos~U) CheY0*0.5	CheY(Phos~P) CheY0*0.5	CheZ() CheZ0end speciesWe now set initial quantities of molecules along with reaction rate parameters to be consistent with in vivo quantities 123.begin parameters	NaV 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	L0 0        #number of ligand molecules	T0 7000       #number of receptor complexes	CheY0 20000	CheZ0 6000	k_lr_bind 8.8e6/NaV   #ligand-receptor binding	k_lr_dis 35           #ligand-receptor dissociation	k_T_phos 15           #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV    #receptor complex phosphorylates CheY	k_Y_dephos 8.6e5/NaV  #Z dephosphorylates CheYend parametersNote: The parameters section has to appear before the reaction rules section.Place everything occurring above between begin model and end model tags.Simulating responses to attractantsBefore running the simulation, let’s think about what will happen. If we don’t add any ligand molecule into the system, then assuming that we have started the simulation at steady state, then the concentrations of phosphorylated receptors and CheY will remain at equilibrium.We can now run the simulation, setting t_end equal to 3 in order to run the simulation for longer than we did in the ligand-receptor tutorial. Place the following code after end model in your BioNetGen file.generate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;3, n_steps=&gt;100})The following code contains our complete simulation, which you can also download here:phosphorylation.bnglbegin modelbegin molecule types	L(t)             #ligand molecule	T(l,Phos~U~P)    #receptor complex	CheY(Phos~U~P)	CheZ()end molecule typesbegin observables	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules phosphorylated_CheA T(Phos~P)	Molecules bound_ligand L(t!1).T(l!1)end observablesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	L0 5e3          #number of ligand molecules	T0 7000       #number of receptor complexes	CheY0 20000	CheZ0 6000	k_lr_bind 8.8e6/NaV2   #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_T_phos 15            #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV2    #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV2  #Z dephosphorylates Yend parametersbegin reaction rules	LR: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis	#Free vs. ligand-bound receptor complexes autophosphorylates at different rates	FreeTP: T(l,Phos~U) -&gt; T(l,Phos~P) k_T_phos	BoundTP: L(t!1).T(l!1,Phos~U) -&gt; L(t!1).T(l!1,Phos~P) k_T_phos*0.2	YP: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDep: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephosend reaction rulesbegin species	L(t) L0	T(l,Phos~U) T0*0.8	T(l,Phos~P) T0*0.2	CheY(Phos~U) CheY0*0.5	CheY(Phos~P) CheY0*0.5	CheZ() CheZ0end speciesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;3, n_steps=&gt;100})Now save your file and run the simulation by clicking the Run BNG button. The results will be saved in a new folder called phosphorylation/TIMESTAMP contained in the current directory. Rename the folder within phosphorylation from the time stamp to L0.Open the newly created phosphorylation.gdat file within this folder, and plot the data by clicking on Built-in plotting. What do you observe?When we add ligand molecules into the system, as we did in the tutorial for ligand-receptor dynamics, the concentration of bound receptors should increase. What will happen to the concentration of phosphorylated CheA, and phosphorylated CheY? What will happen to steady state concentrations?Now run your simulation by changing L0 to be equal to 5000 and then run it again with L0 to be equal to 1e5. Do the results confirm your hypothesis? What happens as we keep changing L0? What happens as L0 gets really large (e.g., 1e9)? What do you think is going on?In the main text, we will explore the results of the above simulation. We will then interpret how differences in the amounts of initial ligand can influence changes in the concentration of phosphorylated CheY (and therefore the bacterium’s tumbling frequency).Return to main text            Li M, Hazelbauer GL. 2004. Cellular stoichimetry of the components of the chemotaxis signaling complex. Journal of Bacteriology. Available online &#8617;              Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. &#8617;              Stock J, Lukat GS. 1991. Intracellular signal transduction networks. Annual Review of Biophysics and Biophysical Chemistry. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Modeling a Pure Random Walk Strategy",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_purerandom",
        "date"     : "",
        "content"  : "In this tutorial, we will simulate a random walk and take a look at how well this allows a bacterium to reach a goal. You might not anticipate that the random walk will do a very good job of this  and you would not be wrong  but it will give us a baseline simple strategy to compare against a more advanced random walk strategy.Specifically, we will build a Jupyter notebook to do so. You can create a blank file called chemotaxis_std_random.ipynb and type along, but the notebook will be quite lengthy, so feel free to download the final notebook here if you like: chemotaxis_std_random.ipynb. A detailed explanation of the model and each function can be found in this completed file as well as the tutorial below.Make sure that the following dependencies are installed:            Installation Link      Version      Check install/version                  Python3      3.6+      python --version              Jupyter Notebook      4.4.0+      jupyter --version              Numpy      1.14.5+      pip list | grep numpy              Matplotlib      3.0+      pip list | grep matplotlib              Colorspace or with pip      any      pip list | grep colorspace      Converting a run-and-tumble model to a random walk simulationOur model will be based on observations from our BioNetGen simulation and known biology of E. coli. We summarize this simulation, discussed in the main text, as follows.  Run. The duration of a cell’s run follows an exponential distribution with mean equal to the background run duration run_time_expected.  Tumble. The duration of a cell’s tumble follows an exponential distribution with mean 0.1s1. When it tumbles, we assume it only changes its orientation for the next run but doesn’t move in space. The degree of reorientation is a random number sampled uniformly between  and 360°.  Gradient. We model an exponential gradient with a goal (1500, 1500) having a concentration of 108. All cells start at the origin (0, 0), which has a concentration of 102. The ligand concentration at a point (x, y) is given by L(x, y) = 100 · 108 · (1-d/D), where d is the distance from (x, y) to the goal, and D is the distance from the origin to the goal; in this case, D is 1500√2  2121 µm.First, we will import all packages needed.import numpy as npimport matplotlib.pyplot as pltimport mathfrom matplotlib import colorsfrom matplotlib import patchesimport colorspaceNext, we specify all the model parameters:  mean tumble time: 0.1s;  cell speed of 20µm/s2.We also set a “seed” of our pseudorandom number generator to ensure that the sequence of “random” numbers given to us by Python will be the same every time we run the simulation. To obtain a different outcome, change the seed.Note: For more on seeding, please consult the discussion of pseudorandom number generation at Programming for Lovers.SEED = 128  #Any random seednp.random.seed(SEED) #set seed for Numpy random number generator#Constants for E.coli tumblingtumble_time_mu = 0.1 #second#E.coli movement constantsspeed = 20         #um/s, speed of E.coli movement#Model constantsstart = [0, 0]  #All cells start at [0, 0]ligand_center = [1500, 1500] #Position of highest concentrationcenter_exponent, start_exponent = 8, 2 #exponent for concentration at [1500, 1500] and [0, 0]origin_to_center = 0 #Distance from start to center, intialized here, will be actually calculated latersaturation_conc = 10 ** 8 #From BNG modelWe now will have two functions that will establish the ligand concentration at a given point (x, y) as equal to L(x, y) = 100 · 108 · (1-d/D).First, we introduce a function to compute the distance between two points in two-dimensional space.# Calculates distance between point a and b# Input: positions a, b. Each in the form array [x, y]# Returns the distance, a float.def distance(a, b):    return math.sqrt((a[0] - b[0]) ** 2 + (a[1] - b[1]) ** 2)Next, we define a function to determine the concentration of ligand at a given position according to our formula, which will use distance as a subroutine.# Calculates the concentration of a given position# Exponential gradient, the exponent follows a linear relationship with distance to center# Input: position pos, [x, y]# Returns the concentration, a float.def calc_concentration(pos):    dist = distance(pos, ligand_center)    exponent = (1 - dist / origin_to_center) * (center_exponent - start_exponent) + start_exponent    return 10 ** exponentThe following tumble_move function chooses a direction of movement as a uniform random number between 0 and  radians. As noted previously, the duration of a cell’s tumble follows an exponential distribution with mean equal to 0.1s.# Samples the new direction and time of a tumble# Calculates projection on the Horizontal and Vertical direction for the next move# No input# Return the horizontal movement projection (float), the vertical one (float), tumble time (float)def tumble_move():    #Sample the new direction unformly from 0 to 2pi, record as a float    new_dir = np.random.uniform(low = 0.0, high = 2 * math.pi)    projection_h = math.cos(new_dir) #displacement projected on Horizontal direction for next run, float    projection_v = math.sin(new_dir) #displacement projected on Vertical direction for next run, float    #Length of the tumbling sampled from exponential distribution with mean=0.1, float    tumble_time = np.random.exponential(tumble_time_mu)    return new_dir, projection_h, projection_v, tumble_timeIn a given run of the simulation, we keep track of the total time t, and we only continue our simulation if t &lt; duration, where duration is a parameter indicating how long to run the simulation. If t &lt; duration, then we apply the following steps to a given cell.  Sample the run duration curr_run_time from an exponential distribution with mean run_time_expected;  run for curr_run_time seconds in the current direction;  sample the duration of tumble tumble_time;  determine the new direction of the simulated bacterium by calling the tumble_move function discussed above;  increment t by curr_run_time and tumble_time.These steps are achieved by the simulate_std_random function below, which takes the number of cells num_cells to simulate, the time to run each simulation for duration, and the mean time of a single run run_time_expected. This function stores the trajectories of these cells in a variable named path.# This function performs simulation# Input: number of cells to simulate (int), how many seconds (int), the expected run time before tumble (float)# Return: the simulated trajectories path: array of shape (num_cells, duration+1, 2)def simulate_std_random(num_cells, duration, run_time_expected):    #Takes the shape (num_cells, duration+1, 2)    #any point [x,y] on the simulated trajectories can be accessed via path[cell, time]    path = np.zeros((num_cells, duration + 1, 2))    for rep in range(num_cells):        # Initialize simulation        t = 0 #record the time elapse        curr_position = np.array(start) #start at [0, 0]        curr_direction, projection_h, projection_v, tumble_time = tumble_move() #Initialize direction randomly        past_sec = 0        while t &lt; duration:            #run            curr_run_time = np.random.exponential(run_time_expected) #get run duration, float            #displacement on either direction is calculated as the projection * speed * time            #update current position by summing old position and displacement            curr_position = curr_position + np.array([projection_h, projection_v]) * speed * curr_run_time            #tumble            curr_direction, projection_h, projection_v, tumble_time = tumble_move()            #increment time            t += (curr_run_time + tumble_time)            #record position approximate for integer number of second            curr_sec = int(t)            for sec in range(past_sec, min(curr_sec, duration) + 1):                #fill values from last time point to current time point                path[rep, sec] = curr_position.copy()                past_sec= curr_sec    return pathNow that we have established parameters and written the functions that we will need, we will run our simulation with num_cells equal to 3 and duration equal to 500 to get a rough idea of what the trajectories of our simulated cells will look like.#Run simulation for 3 cells with different background tumbling frequencies, Plot pathduration = 800  #seconds, duration of the simulation, intnum_cells = 3   #number of cells, intorigin_to_center = distance(start, ligand_center) #Update the global constantrun_time_expected = 1.0 #expected run time before tumble, float#Calls the simulate functionpath = simulate_std_random(num_cells, duration, run_time_expected) #get the simulated trajectoriesprint(path[:,-1,:]) #print the terminal poistion of each simulationVisualizing simulated cell trajectoriesNow that we have generated the data of our randomly walking cells, our next step is to plot these trajectories using Matplotlib. We will color-code the background ligand concentration. The ligand concentrations at each position (a, b) where a and b are both integers can be represented using a matrix, and we take the logarithm of each value of this matrix to better color our exponential gradient. That is, a value of 108 will be converted to 8, and a value of 104 will be converted to 4. A white background color will indicate a low ligand concentration, while red indicates high concentration.#Below are all for plotting purposes#Initialize the plot with 1*1 subplot of size 8*8fig, ax = plt.subplots(1, 1, figsize = (8, 8))#First set color map to color-code the concentrationmycolor = [[256, 256, 256], [256, 255, 254], [256, 253, 250], [256, 250, 240], [255, 236, 209], [255, 218, 185], [251, 196, 171], [248, 173, 157], [244, 151, 142], [240, 128, 128]] #RGB values, from coolors:)for i in mycolor:    for j in range(len(i)):        i[j] *= (1/256) #normalize to 0~1 rangecmap_color = colors.LinearSegmentedColormap.from_list('my_list', mycolor) #Linearly segment these colors to create a continuous color map#Store the concentrations for each integer position in a matrixconc_matrix = np.zeros((4000, 4000)) #we will display from [-1000, -1000] to [3000, 3000]for i in range(4000):    for j in range(4000):        conc_matrix[i][j] = math.log(calc_concentration([i - 1000, j - 1000])) #calculate the exponents of concentrations at each location#Simulate the gradient distribution, plot as a heatmapax.imshow(conc_matrix.T, cmap=cmap_color, interpolation='nearest', extent = [-1000, 3000, -1000, 3000], origin = 'lower')Next, we plot each cell’s trajectory over each of its tumbling points. To visualize older vs. newer time points, we set the color as a function of t so that newer points have lighter colors.#Plot simulation resultstime_frac = 1.0 / duration#Plot the trajectories. Time progress: dark -&gt; colorfulfor t in range(duration):    ax.plot(path[0,t,0], path[0,t,1], 'o', markersize = 1, color = (0.2 * time_frac * t, 0.85 * time_frac * t, 0.8 * time_frac * t))    ax.plot(path[1,t,0], path[1,t,1], 'o', markersize = 1, color = (0.85 * time_frac * t, 0.2 * time_frac * t, 0.9 * time_frac * t))    ax.plot(path[2,t,0], path[2,t,1], 'o', markersize = 1, color = (0.4 * time_frac * t, 0.85 * time_frac * t, 0.1 * time_frac * t))ax.plot(start[0], start[1], 'ko', markersize = 8) #Mark the starting point [0, 0]for i in range(num_cells):    ax.plot(path[i,-1,0], path[i,-1,1], 'ro', markersize = 8) #Mark the terminal points for each cellWe mark the starting point of each cell’s trajectory with a black dot and the ending point of the trajectory with a red dot. We place a blue cross over the goal. Finally, we set axis limits, assign axis labels, and generate the plot.ax.plot(1500, 1500, 'bX', markersize = 8) #Mark the highest concentration point [1500, 1500]ax.set_title("Pure random walk \n Background: avg tumble every {} s".format(run_time_expected), x = 0.5, y = 0.87)ax.set_xlim(-1000, 3000)ax.set_ylim(-1000, 3000)ax.set_xlabel("poisiton in um")ax.set_ylabel("poisiton in um")plt.show()STOP: Run the notebook. What do you observe? Are the cells moving up the gradient? Is this a good strategy for a bacterium to use to search for food?Quantifying the performance of our search algorithmWe already know from our work in previous modules that a random walk simulation can produce very different outcomes. To assess the performance of the random walk algorithm, we will simulate num_cells = 500 cells and duration = 1500 seconds.Visualizing the trajectories for this many cells will be messy. Instead, we will measure the distance between each cell and the target at the end of the simulation, and then take the average and standard deviation of this value over all cells.#Run simulation for 500 cells, plot average distance to highest concentration pointduration = 1500   #seconds, duration of the simulationnum_cells = 500 #number of cells, intorigin_to_center = distance(start, ligand_center) #Update the global constantorigin_to_center = distance(start, ligand_center) #Update the global constantrun_time_expected = 1.0 #expected run time before tumble, floatall_distance = np.zeros((num_cells, duration)) #Initialize to store results, array with shape (num_cells, duration)paths = simulate_std_random(num_cells, duration, run_time_expected) #run simulationfor cell in range(num_cells):    for time in range(duration):        pos = paths[cell, time] #get the position [x,y] for the cell at a given time        dist = distance(ligand_center, pos) #calculate the Euclidean distance between that position to [1500, 1500]        all_distance[cell, time] = dist #record this distance# For all time, take average and standard deviation over all cells.all_dist_avg = np.mean(all_distance, axis = 0) #Calculate average over cells, array of shape (duration,)all_dist_std = np.std(all_distance, axis = 0) #Calculate the standard deviation, array of shape (duration,)We will then plot the average and standard deviation of the distance to goal using the plot and fill_between functions.#Below are all for plotting purposes#Define the colors to usecolors1 = colorspace.qualitative_hcl(h=[0, 300.], c = 60, l = 70, pallete = "dynamic")(1)xs = np.arange(0, duration) #Set the x-axis for plot: time points. Array of integers of shape (duration,)fig, ax = plt.subplots(1, 1, figsize = (10, 8)) #Initialize the plot with 1*1 subplot of size 10*8mu, sig = all_dist_avg, all_dist_std#Plot average distance vs. timeax.plot(xs, mu, lw=2, label="pure random walk, back ground tumble every {} second".format(run_time_expected), color=colors1[0])#Fill in average +/- one standard deviation vs. timeax.fill_between(xs, mu + sig, mu - sig, color = colors1, alpha=0.15)ax.set_title("Average distance to highest concentration")ax.set_xlabel('time (s)')ax.set_ylabel('distance to center (µm)')ax.hlines(0, 0, duration, colors='gray', linestyles='dashed', label='concentration 10^8')ax.legend(loc='upper right')ax.grid()STOP: Before visualizing the average distances at each time step, what do you expect the average distance to the goal to be?Now, run the notebook. The colored line indicates average distance of the 500 cells; the shaded area corresponds to one standard deviation from the mean; and the grey dashed line corresponds to a maximum ligand concentration of 108.As mentioned, you may not be surprised that this simple random walk strategy is not very effective at finding the goal. Not to worry: in the main text, we discuss how to adapt this strategy into one that better reflects how E. coli explores its environment based on what we have learned in this module about chemotaxis.Return to main text            Saragosti J., Siberzan P., Buguin A. 2012. Modeling E. coli tumbles by rotational diffusion. Implications for chemotaxis. PLoS One 7(4):e35412. available online. &#8617;              Baker MD, Wolanin PM, Stock JB. 2005. Signal transduction in bacterial chemotaxis. BioEssays 28:9-22. Available online &#8617;      "
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Simulating Particle Diffusion",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/tutorial-random-walk",
        "date"     : "",
        "content"  : "Setting up CellBlenderIn the main text, we mentioned that we would need MCell to simulate a reaction-diffusion particle model, along with the CellBlender add-on for Blender that will integrate MCell simulations and help us visualize these simulations.The current version of these tutorials is written for MCell3, CellBlender 3.5.1, and Blender 2.79. You can download and install all three programs by following the installation guide on the “Previous downloads” page at the MCell homepage. Once you have followed the installation instructions, start Blender, and you will be ready to build and visualize a particle diffusion model.Note: We are currently in the process of updating this tutorial to the latest version of MCell, CellBlender, and Blender.Setting up CellBlender simulationsFrom a new Blender file, initialize CellBlender. Delete the existing default cube by right-clicking on the cube to select the cube (an orange outline should be around the cube when it is selected) and pressing the “x” key to delete. Then, in the tab CellBlender &gt; Model Objects, insert a new plane, following the figure below.In CellBlender &gt; Model Objects, click the + symbol to center the cursor. Next press the square “plane” button to create the object. To have CellBlender recognize this object as a model object, press the + button. The name of this object is Plane by default, although you can change this name and edit the color by selecting the color wheel if you like. A slightly transparent coloring will help with visibility but is not necessary.Resizing the render preview window so that objects are visible in the center of the screen is recommended. See the following figure for instructions. Then save your file as CellBlender_Tutorial_Template.blend.From the View menu, select Top to align the view directly overhead. With the plane object selected, follow the arrow over to the object parameters menu (the orange cube) and scale the plane by setting the first two values to “1.5”. Then, hover the mouse over the object and either use ctrl + “+” 6 times or the scroll wheel on your mouse to zoom in.Navigating the CellBlender windowThis section will provide images and descriptions for the different components of the Blender window. When a new file has been created, the following figure shows the menu options available.A: This is the window for modules like CellBlender. To start CellBlender, you must click the CellBlender tab and then click the Initialize CellBlender button as shown in the image. This will then display the image shown as “D” in the figure below.B: There are many View tabs throughout the Blender window. Any future tutorials referring to the View tab are referencing this tab.C: This window contains options relating to a selected object.D: This is the CellBlender menu, which opens after CellBlender has been initialized, and contains sub-menus which will be noted as follows: CellBlender &gt; Model Objects. We recommend dragging the edge of the window outward to increase visibility (see box “e” on the image above).Implementing particle diffusionIn CellBlender, load the CellBlender_Tutorial_Template.blend file from the previous section and save your file as random_walk.blend. You may also download the completed tutorial file here.Right click the plane object to ensure it is selected. Visit the object parameters menu (the orange cube) and move the plane by setting the third location value to 1.0 instead of 0.0.Then select CellBlender &gt; Molecules and create the following molecules:  Click the + button.  Select a color (such as orange).  Name the molecule X.  Select the molecule type as Surface Molecule.  Add a diffusion constant of 1e-6.  Increase the scale factor to 5 (click and type 5 or use the arrows).Now visit CellBlender &gt; Molecule Placement to set the following sites for molecules to be released:  Click the + button.  Select or type in the molecule X.  Type in the name of the Object/Region Plane.  Set the Quantity to Release as 1.Finally, we are ready to run our diffusion simulation. Visit CellBlender &gt; Run Simulation and select the following options:  Set the number of iterations to 1000.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.The simulation should run quickly, and we are ready to visualize the outcome of the simulation. To do so, visit CellBlender &gt; Reload Visualization Data. You have the option of watching the animation within the Blender window by clicking the play button at the bottom of the screen, as indicated in the figure below. Then, save your file.You can also save and export the movie of your animation using the following steps:  Click the movie tab.  Scroll down to the file name.  Select a suitable location for your file.  Select your favorite file format (we suggest FFmpeg_video).  Click Render &gt; OpenGL Render Animation.The movie will begin playing, and when the animation is complete, the movie file should be found in the folder location you selected.Now that we have run and visualized our diffusion, we will head back to the main text, where we will continue on with our discussion of how the diffusion of particles can help us find Turing patterns.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Generating Turing Patterns with a Reaction-Diffusion Simulation",
        "category" : "",
        "tags"     : "",
        "url"      : "/prologue/turing-cellblender",
        "date"     : "",
        "content"  : "Note: We are currently in the process of updating this tutorial to the latest version of MCell, CellBlender, and Blender. This tutorial works with MCell3, CellBlender 3.5.1, and Blender 2.79. Please see the previous tutorial for a link to download these versions.In this tutorial, we will build the predator-prey reaction-diffusion model that we introduced in the main text. A warning that these simulations can take a long time to run.Load the CellBlender_Tutorial_Template.blend file that you generated in the Random Walk Tutorial. You may also download the complete file here. Save this file as a new file named turing_pattern.blend. The completed tutorial is also available here.We will first visit CellBlender &gt; Molecules and create the B molecules, as shown in the screenshot below.  Click the + button.  Select a color (such as red).  Name the molecule B.  Under molecule type, select surface molecule.  Add a diffusion constant of 3e-6. The diffusion constant indicates how many units to move the particle every time unit. (We will specify the time unit below at runtime.)  Up the scale factor to 2 (click and type 2 or use the arrows).Then, repeat the above steps to make sure that all of the following molecules are entered. We use a molecule named Hidden to represent a “hidden” molecule that will be used to generate A molecules.            Molecule Name      Molecule Type      Color      Diffusion Constant      Scale Factor                  B      Surface      Red      3e-6      3              A      Surface      Green      6e-6      3              Hidden      Surface      Blue      1e-6      0      Now visit CellBlender &gt; Molecule Placement to set the following sites for releasing molecules of each of the three types. First, we will release the hidden molecules across the region so that any new A particles will be produced uniformly.  Click the + button.  Select or type in the molecule Hidden.  Type in the name of the Object/Region Plane.  Set the Quantity to Release as 1000.Then, repeat the above steps to release an initial quantity of A molecules as well, using the following table.            Molecule Name      Object/Region      Quantity to Release                  Hidden      Plane      1000              A      Plane      6000      We are going to release an initial collection of B particles in a cluster in the center of the plane. To do so, we will need a very specific initial release of these particles, and so we will not be able to use the Molecule Placement tab. For this reason, we need to write a Python script to place these molecules, shown below. You can download this script here. (Don’t worry if you are not comfortable with Python.)import cellblender as cbdm = cb.get_data_model()mcell = dm['mcell']rels = mcell['release_sites']rlist = rels['release_site_list']point_list = []for x in range(10):    for y in range(10):        point_list.append([x/100,y/100,0.0])for x in range(10):    for y in range(10):        point_list.append([x/100 - 0.5,y/100 - 0.5,0.0])for x in range(10):    for y in range(10):        point_list.append([x/100 - 0.8,y/100,0.0])for x in range(10):    for y in range(10):        point_list.append([x/100 + 0.8,y/100 - 0.8,0.0])new_rel = {    'data_model_version' : "DM_2015_11_11_1717",    'location_x' : "0",    'location_y' : "0",    'location_z' : "0",    'molecule' : "B",    'name' : "pred_rel",    'object_expr' : "arena",    'orient' : "'",    'pattern' : "",    'points_list' : point_list,    'quantity' : "400",    'quantity_type' : "NUMBER_TO_RELEASE",    'release_probability' : "1",    'shape' : "LIST",    'site_diameter' : "0.01",    'stddev' : "0"}rlist.append ( new_rel )cb.replace_data_model ( dm )Locate the Outliner pane on the top-right of the Blender screen. On the left of the view button in the Outliner pane, there is a code tree icon.Click this icon and choose Text Editor. To create a new file for our code, click the + button. Copy and paste the code above (or from your downloaded file) into the text editor and save it with the name pred-center.py.Next visit CellBlender &gt; Scripting &gt; Data-Model Scripting &gt; Run Script, as shown in the following screenshot. Select Internal from the Data-Model Scripting menu and click the refresh button. Click the filename entry area next to File and enter pred_center.py. Click Run Script to execute.You should see that another placement site called pred_rel has appeared in the Molecule Placement tab.Next go to CellBlender &gt; Reactions to create the reactions that will drive the system.  Click the + button.  Under reactants, type Hidden; (note: the semi-colon is important).  Under products, type Hidden; + A;  Set the forward rate as 1e5.Repeat these steps to ensure that we have all of the following reactions.            Reactants      Products      Forward Rate                  Hidden;      Hidden; + A;      1e5              B;      NULL      1e5              B; + B; + A;      B; + B; + B;      1e1      We are now ready to run our simulation. To do so, visit CellBlender &gt; Run Simulation and select the following options:  Set the number of iterations to 200.  Ensure the time step is set as 1e-6.  Click Export &amp; Run.Once the run is complete, save your file.We can also now visualize our simulation. Click CellBlender &gt; Reload Visualization Data. You have the option of watching the animation within the Blender window by clicking the play button at the bottom of the screen.If you like, you can export this animation by using the following steps:  Click the movie tab.  Scroll down to the file name.  Select a suitable location for the file.  Select the file type you would like (we suggest FFmpeg_video).  Click Render &gt; OpenGL Render Animation.The movie will begin playing, and when the animation is complete, the movie file should be in the folder location you selected.You may be wondering how the parameters in the above simulations were chosen. The fact of the matter is that for many choices of these parameters, we will obtain behavior that does not produce an animation as interesting as what we found in this tutorial. Furthermore, try making slight changes to the feed and kill rates in the CellBlender reactions (e.g., multiplying one of them by 1.25) and watching the animation. How does a small change in parameters cause the animation to change?As we return to the main text, we will discuss how the patterns that we observe change as we make slight changes to these parameters. What biological conclusion can we draw from this phenomenon?Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Traveling Down an Attractant Gradient",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_removal",
        "date"     : "",
        "content"  : "In the previous tutorial, we simulated the behavior of a bacterium moving up the concentration gradient. In this tutorial, we will simulate the opposite - when the bacterium is not in luck and moves down a concentration gradient.To get started, open Visual Studio Code, and click File &gt; Open Folder.... Open the EColiSimulations folder from the first tutorial. If you would rather not follow along below, you can download a completed BioNetGen file here: removal.bngl.We also will build a Jupyter notebook in this tutorial for plotting the concentrations of different particles over time. To do so, you should save a copy of your plotter_up.ipynb file called plotter_down.ipynb; if you would rather not follow along, we provide a completed notebook here: plotter_down.ipynbBefore running this notebook, make sure the following dependencies are installed.            Installation Link      Version      Check install/version                  Python3      3.6+      python --version              Jupyter Notebook      4.4.0+      jupyter --version              Numpy      1.14.5+      pip list \| grep numpy              Matplotlib      3.0+      pip list \| grep matplotlib              Colorspace or with pip      any      pip list \| grep colorspace      Modeling a decreasing ligand gradient with a BioNetGen functionWe have simulated how the concentration of phosphorylated CheY changes when the cell moves up the attractant gradient. The concentration dips, but over time, methylation states change so that they can compensate for the increased ligand-receptor binding and restore the equilibrium of phosphorylated CheY. What if instead ligands are removed, as we would see if the bacterium is traveling down an attractant gradient? We might imagine that we would see an increase in phosphorylated CheY to increase tumbling and change course, followed by a return to steady state. But is this what we will see?To simulate the cell traveling down an attractant gradient, we will add a kill reaction removing unbound ligand at a constant rate. To do so, we will add the following rule within the reaction rules section.#Simulate ligand removalLigandGone: L(t) -&gt; 0 k_goneIn the parameters section, we start by defining k_gone to be 0.3, so that d[L]/dt = -0.3[L]. The solution of this differential equation is [L] = 107e-0.3t. We will also change the initial ligand concentration (L0) to be 1e7. Thus, the concentration of ligand becomes so low that ligand-receptor binding reaches 0 within 50 seconds.k_gone 0.3L0 1e7We will set the initial concentrations of all species to be the final steady state concentrations from the result for our adaptation.bngl model, and see if after reducing the concentration of unbound ligand gradually, the simulation can restore these concentrations to steady state.First, visit the adaptation.bngl model and add the concentration for each combination of methylation state and ligand binding state of the receptor complex to the observables section. Then run this simulation with L0 equal to 1e7.When the simulation is finished, visit the adaptation folder from the adaptation tutorial and find the simulation result at the final time point.When the model finishes running, input the final concentrations of molecules to the species section of our removal.bngl model. Here is what we have.begin species	@EC:L(t) L0	@PM:T(l!1,r,Meth~A,Phos~U).L(t!1) 1190	@PM:T(l!1,r,Meth~B,Phos~U).L(t!1) 2304	@PM:T(l!1,r,Meth~C,Phos~U).L(t!1) 2946	@PM:T(l!1,r,Meth~A,Phos~P).L(t!1) 2	@PM:T(l!1,r,Meth~B,Phos~P).L(t!1) 156	@PM:T(l!1,r,Meth~C,Phos~P).L(t!1) 402	@CP:CheY(Phos~U) CheY0*0.71	@CP:CheY(Phos~P) CheY0*0.29	@CP:CheZ() CheZ0	@CP:CheB(Phos~U) CheB0*0.62	@CP:CheB(Phos~P) CheB0*0.38	@CP:CheR(t) CheR0end speciesRunning the BioNetGen modelWe are now ready to run our BioNetGen model. To do so, first add the following after end model to run our simulation over 1800 seconds.generate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;1800, n_steps=&gt;1800})The following code contains our complete simulation, which can also be downloaded here: removal.bngl.begin modelbegin molecule types	L(t)	T(l,r,Meth~A~B~C,Phos~U~P)	CheY(Phos~U~P)	CheZ()	CheB(Phos~U~P)	CheR(t)end molecule typesbegin observables	Molecules bound_ligand L(t!1).T(l!1)	Molecules phosphorylated_CheY CheY(Phos~P)	Molecules low_methyl_receptor T(Meth~A)	Molecules medium_methyl_receptor T(Meth~B)	Molecules high_methyl_receptor T(Meth~C)	Molecules phosphorylated_CheB CheB(Phos~P)end observablesbegin parameters	NaV2 6.02e8   #Unit conversion to cellular concentration M/L -&gt; #/um^3	miu 1e-6	L0 1e7	T0 7000	CheY0 20000	CheZ0 6000	CheR0 120	CheB0 250	k_lr_bind 8.8e6/NaV2  #ligand-receptor binding	k_lr_dis 35            #ligand-receptor dissociation	k_TaUnbound_phos 7.5   #receptor complex autophosphorylation	k_Y_phos 3.8e6/NaV2   #receptor complex phosphorylates Y	k_Y_dephos 8.6e5/NaV2  #Z dephosphorylates Y	k_TR_bind  2e7/NaV2 #Receptor-CheR binding	k_TR_dis   1          #Receptor-CheR dissociation	k_TaR_meth 0.08       #CheR methylates receptor complex	k_B_phos 1e5/NaV2      #CheB phosphorylation by receptor complex	k_B_dephos 0.17       #CheB autodephosphorylation	k_Tb_demeth 5e4/NaV2   #CheB demethylates receptor complex	k_Tc_demeth 2e4/NaV2 #CheB demethylates receptor complex	k_gone 0.3end parametersbegin reaction rules	LigandReceptor: L(t) + T(l) &lt;-&gt; L(t!1).T(l!1) k_lr_bind, k_lr_dis	#Receptor complex (specifically CheA) autophosphorylation	#Rate dependent on methylation and binding states	#Also on free vs. bound with ligand	TaUnboundP: T(l,Meth~A,Phos~U) -&gt; T(l,Meth~A,Phos~P) k_TaUnbound_phos	TbUnboundP: T(l,Meth~B,Phos~U) -&gt; T(l,Meth~B,Phos~P) k_TaUnbound_phos*1.1	TcUnboundP: T(l,Meth~C,Phos~U) -&gt; T(l,Meth~C,Phos~P) k_TaUnbound_phos*2.8	TaLigandP: L(t!1).T(l!1,Meth~A,Phos~U) -&gt; L(t!1).T(l!1,Meth~A,Phos~P) 0	TbLigandP: L(t!1).T(l!1,Meth~B,Phos~U) -&gt; L(t!1).T(l!1,Meth~B,Phos~P) k_TaUnbound_phos*0.8	TcLigandP: L(t!1).T(l!1,Meth~C,Phos~U) -&gt; L(t!1).T(l!1,Meth~C,Phos~P) k_TaUnbound_phos*1.6	#CheY phosphorylation by T and dephosphorylation by CheZ	YPhos: T(Phos~P) + CheY(Phos~U) -&gt; T(Phos~U) + CheY(Phos~P) k_Y_phos	YDephos: CheZ() + CheY(Phos~P) -&gt; CheZ() + CheY(Phos~U) k_Y_dephos	#CheR binds to and methylates receptor complex	#Rate dependent on methylation states and ligand binding	TRBind: T(r) + CheR(t) &lt;-&gt; T(r!2).CheR(t!2) k_TR_bind, k_TR_dis	TaRUnboundMeth: T(r!2,l,Meth~A).CheR(t!2) -&gt; T(r,l,Meth~B) + CheR(t) k_TaR_meth	TbRUnboundMeth: T(r!2,l,Meth~B).CheR(t!2) -&gt; T(r,l,Meth~C) + CheR(t) k_TaR_meth*0.1	TaRLigandMeth: T(r!2,l!1,Meth~A).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~B).L(t!1) + CheR(t) k_TaR_meth*30	TbRLigandMeth: T(r!2,l!1,Meth~B).L(t!1).CheR(t!2) -&gt; T(r,l!1,Meth~C).L(t!1) + CheR(t) k_TaR_meth*3	#CheB is phosphorylated by receptor complex, and autodephosphorylates	CheBphos: T(Phos~P) + CheB(Phos~U) -&gt; T(Phos~U) + CheB(Phos~P) k_B_phos	CheBdephos: CheB(Phos~P) -&gt; CheB(Phos~U) k_B_dephos	#CheB demethylates receptor complex	#Rate dependent on methyaltion states	TbDemeth: T(Meth~B) + CheB(Phos~P) -&gt; T(Meth~A) + CheB(Phos~P) k_Tb_demeth	TcDemeth: T(Meth~C) + CheB(Phos~P) -&gt; T(Meth~B) + CheB(Phos~P) k_Tc_demeth	#Simulate ligand removal	LigandGone: L(t) -&gt; 0 k_goneend reaction rulesbegin compartments  EC  3  100       #um^3  PM  2  1   EC    #um^2  CP  3  1   PM    #um^3end compartmentsbegin species	@EC:L(t) L0	@PM:T(l!1,r,Meth~A,Phos~U).L(t!1) 1190	@PM:T(l!1,r,Meth~B,Phos~U).L(t!1) 2304	@PM:T(l!1,r,Meth~C,Phos~U).L(t!1) 2946	@PM:T(l!1,r,Meth~A,Phos~P).L(t!1) 2	@PM:T(l!1,r,Meth~B,Phos~P).L(t!1) 156	@PM:T(l!1,r,Meth~C,Phos~P).L(t!1) 402	@CP:CheY(Phos~U) CheY0*0.71	@CP:CheY(Phos~P) CheY0*0.29	@CP:CheZ() CheZ0	@CP:CheB(Phos~U) CheB0*0.62	@CP:CheB(Phos~P) CheB0*0.38	@CP:CheR(t) CheR0end speciesend modelgenerate_network({overwrite=&gt;1})simulate({method=&gt;"ssa", t_end=&gt;1800, n_steps=&gt;1800})Now save your file and run the simulation by clicking Run BNG. The results will be saved in a new folder called removal/TIME contained in the current directory. Rename the folder from the timestamp to the value of k_gone, 0.3.Open the newly created removal.gdat file and create a plot by clicking the Built-in plotting button.What happens to the concentration of phosphorylated CheY? Are the concentrations of complexes at different methylation states restored to their levels before adding ligands to the adaptation.bngl model?As we did in the tutorial simulating increasing ligand, we can try different values for k_gone. Change t_end in the simulate method to 1800 seconds, and run the simulation with k_gone equal to 0.01, 0.03, 0.05, 0.1, and 0.5.Note: All simulation results are stored in the removal directory in your working directory. As you change the values of k_gone, rename the directory with the k_gone values instead of the timestamp for simplicity.Visualizing the results of our simulationWe will use the jupyter notebook plotter_up.ipynb as a template for the plotter_down.ipynb file that we will use to visualize our results. First, we will specify the directories, model name, species of interest, and reaction rates. Put the Jupyter notebook in the same directory as removal.bngl or change the model_path accordingly.model_path = "removal"  #The folder containing the modelmodel_name = "removal"  #Name of the modeltarget = "phosphorylated_CheY"    #Target moleculevals = [0.01, 0.03, 0.05, 0.1, 0.3, 0.5]  #Gradients of interestThe second code block is the same as that provided in the previous tutorial. This code loads the simulation result at each time point from the .gdat file, which stores the concentration of all observables at all steps. It then plots the concentration of phosphorylated CheY over time.import numpy as npimport sysimport osimport matplotlib.pyplot as pltimport colorspace#Define the colors to usecolors = colorspace.qualitative_hcl(h=[0, 300.], c = 60, l = 70, pallete = "dynamic")(len(vals))def load_data(val):    file_path = os.path.join(model_path, str(val), model_name + ".gdat")    with open(file_path) as f:        first_line = f.readline() #Read the first line    cols = first_line.split()[1:] #Get the col names (species names)    ind = 0    while cols[ind] != target:        ind += 1                  #Get col number of target molecule    data = np.loadtxt(file_path)  #Load the file    time = data[:, 0]             #Time points    concentration = data[:, ind]  #Concentrations    return time, concentrationdef plot(val, time, concentration, ax, i):    legend = "k = - " + str(val)    ax.plot(time, concentration, label = legend, color = colors[i])    ax.legend()    returnfig, ax = plt.subplots(1, 1, figsize = (10, 8))for i in range(len(vals)):    val = vals[i]    time, concentration = load_data(val)    plot(val, time, concentration, ax, i)plt.xlabel("time")plt.ylabel("concentration (#molecules)")plt.title("Active CheY vs time")ax.minorticks_on()ax.grid(b = True, which = 'minor', axis = 'both', color = 'lightgrey', linewidth = 0.5, linestyle = ':')ax.grid(b = True, which = 'major', axis = 'both', color = 'grey', linewidth = 0.8 , linestyle = ':')plt.show()Run the notebook. How does the value of k_gone impact the concentration of phosphorylated CheY? Why? Are the tumbling frequencies restored to the background frequency? As we return to the main text, we will show the resulting plots and discuss these questions.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Using RMSD to Compare the Predicted SARS-CoV-2 Spike Protein Against its Experimentally Validated Structure",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_rmsd",
        "date"     : "",
        "content"  : "In this tutorial, we will demonstrate how to apply the Kabsch algorithm to compute the RMSD between two protein structures. In particular, we will show how to compare the experimentally validated structure of the SARS-CoV-2 spike protein (PDB entry: 6vxx) against one of our resulting from homology modeling. You should then feel empowered to run this comparison on our other spike protein predictions, as well as compare our ab initio prediction of human hemoglobin subunit alpha against its validated structure (PDB entry: 1si4).Below is a folder containing all the models that we produced so far along with the experimentally validated structures. Please consult the included README.txt to see which PDB structure to use for comparison against each predicted structure.Download modelsGetting startedThis tutorial will be our first encounter with ProDy, our featured software resource in this module. ProDy is an open-source Python package that allows users to perform protein structural dynamics analysis. Its flexibility allows users to select specific parts or atoms of the structure for conducting normal mode analysis and structure comparison. If you’re not interested in following along, you can download a Jupyter notebook along with all files needed to run this tutorial below.Download completed tutorialTo get started, make sure that you have the following software resources installed.Python (2.7, 3.5, or later)ProDyNumPyBiopythonIPythonMatplotlibWe recommend that you create a workspace (directory) for storing created files when using ProDy or storing protein .pdb files. Open your computer’s terminal app and navigate to this directory using the cd (if you are new to using a command line interface, please consult this introduction.). Then, start IPython using the following command.ipython --pylabFirst, import needed packages and turn interactive mode on (you only need to do this once per session).In[#]: from pylab import *In[#]: from prody import *In[#]: ion()Calculating RMSD of two chainsWe will first compute RMSD for a single chain of the spike protein homotrimer. Because we are dealing with the entire spike protein, we will need to “match” chains that have the greatest paired similarity between our prediction and the result.The first built-in ProDy function that we will use, called parsePDB, parses a protein structure in .pdb format. To use our own protein structure, make sure that the .pdb file is in the current directory. Let’s parse in one of our models we obtained from homology modeling of the SARS-CoV-2 Spike protein, SWISS1. You can use your own SARS-CoV-2 Spike protein model that you generated. In this tutorial, our model will be called swiss1.pdb.In[#]: struct1 = parsePDB(‘swiss1.pdb’)Because we want to find out how well swiss1.pdb performed, we will compare it to the determined protein structure of SARS-CoV-2 Spike protein in the Protein Data Bank. Enter the code shown below. Because the .pdb extension is missing, this command will prompt the console to search for 6vxx, the SARS-CoV-2 Spike protein, from the Protein Data Bank and download the .pdb file into the current directory. It will then save the structure as the variable struct2.In[#]: struct2 = parsePDB(‘6vxx’)With the protein structures parsed, we can now match chains. To do so, we use a built-in function matchChains with a sequence identity threshold of 75% and an overlap threshold of 80% is specified (the default is 90% for both parameters). The following function call stores the result in a 2D array called matches. matches[i] denotes the i-th match found between two chains that are stored as matches[i][0] and matches[i][1].In[#]: matches = matchChains(struct1, struct2, seqid = 75, overlap = 80)We will now define our own function that will print matched chains.In[#]: def printMatch(match):...: print('Chain 1 : {}'.format(match[0]))...: print('Chain 2 : {}'.format(match[1]))...: print('Length : {}'.format(len(match[0])))...: print('Seq identity: {}'.format(match[2]))...: print('Seq overlap : {}'.format(match[3]))...: print('RMSD : {}\n'.format(calcRMSD(match[0], match[1])))...:Let’s call our new function printmatch on our previous variable matches.In[#]: for match in matches:…: printMatch(match)…:You should see the results printed out as follows.For example, matches[0][0] corresponds to Chain 1 : AtomMap Chain A from swiss1 -&gt; Chain A from 6vxx and matches[5][1] corresponds to Chain 2: AtomMap Chain C from 6vxx -&gt; Chain B from swiss1.Say that we want to calculate the RMSD score between the matched Chain B from swiss1 and Chain B from 6vxx. This will correspond to matches[4][0] and matches[4][1]. After accessing these two structures, we need to to apply the Kabsch algorithm to superimpose and rotate the two structures so that they are as similar as possible, which we do with the built-in function calcTransformation.In[#]: first_ca = matches[4][0]In[#]: second_ca = matches[4][1]In[#]: calcTransformation(first_ca, second_ca).apply(first_ca);Now that the best rotation has been found, we can determine the RMSD between the structures using the built-in function calcRMSD.In[#]: calcRMSD(first_ca, second_ca)You should now see something like the following:Merging multiple chains to compute RMSD of an overall structureNow that we can compare the structures of two chains, it is also possible to merge the chains and calculate the RMSD of the overall homotrimer structure. Below, we merge the three matches corresponding to matching the A chains, B chains, and C chains of the two proteins, and we then compute the RMSD of the resulting structures.In[#]: first_ca = matches[0][0] + matches[4][0] + matches[8][0]In[#]: second_ca = matches [0][1] + matches[4][1] + matches[8][1]In[#]: calcTransformation(first_ca, second_ca).apply(first_ca);In[#]: calcRMSD(first_ca, second_ca)Your results should look like the following:We will leave the RMSD computation for the other models we produced as an exercise.STOP: Apply what you have learned in this tutorial to compute the RMSD between the SARS-CoV-2 spike protein and every one of our predicted homology models, as well as between human hemoglobin subunit alpha and its ab initio model. Download the predicted models here; you should consult the included readme for reference.We are now ready to head back to the main text, where we will discuss the RMSD calculations for all models. Were we successful in predicting the structure of the SARS-CoV-2 spike protein?Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Generalizing and Visualizing an Image Shape Space After Applying PCA",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/tutorial_shape_space",
        "date"     : "",
        "content"  : "In a previous tutorial, we segmented and binarized a collection of WBC images. If you completed that tutorial, then you should see those images as a collection of .tiff files in your BWImgs_1 folder inside your WBC_PCAPipeline/Data directory.We are now ready to use CellOrganizer to build a shape space of these images and then apply PCA to the resulting shape vectors in order to reduce the dimension of the dataset.Note: Currently, this tutorial only works for Mac and Linux users. We have created an alternative version of this tutorial for Windows users, which uses CellOrganizer for Docker, available here.Note: If you are not interested in following this tutorial, or you hit a snag, we are providing the final shape vectors post-PCA in the following file: WBC_PCA.csv. Once this file is downloaded, you can also skip down to shape space visualization.Installing CellOrganizerFirst, as in the previous tutorial, you will need the latest version of MATLAB. You should then download the latest version of CellOrganizer for MATLAB, which you can find under Downloads at the CellOrganizer homepage. After downloading a .zip file, extract this file into a folder, and then place this folder somewhere on your computer where you will remember it. (Our suggestion is to place the folder in the same applications folder where MATLAB is found.)To install CellOrganizer, open MATLAB, and in the command window navigate to the CellOrganizer folder that you just extracted by using the cd command. For example, if you are using MacOS, and you extracted the CellOrganizer folder as cellorganizer-master and moved it to your Applications folder, then you would type the following command:cd /Applications/cellorganizer-masterOnce you have navigated into this folder, you will see the contents of your CellOrganizer directory appear under the Current Directory window in MATLAB, as shown below.You are now ready to install CellOrganizer by running setup.m. To do so, enter the following command into the MATLAB command window.setup();That’s it! If your installation was successful, then you should see a message in the MATLAB command window similar to the following.Adding appropiate folders to path.Checking if your system and Matlab version is compatible with CellOrganizer.Checking for updates. CellOrganizer version 2.9.2 is the latest stable release.Keep MATLAB open, as we will be using it in the next step.Generating a PCA ModelCellOrganizer has several different models to perform a collection of cell modeling tasks; we will focus on demo2D08, which will generate a PCA model for our white blood cell nucleus images. All of the necessary code for doing so is contained in WBC_PCAModel.m, a MATLAB file contained within the WBC_PCAPipeline/Step3_ModelGeneration directory. We will not walk through all the details of this file, but feel free to open this file with a text editor.Run the following commands in the MATLAB command window to navigate into the WBC_PCAPipeline/Step3_ModelGeneration directory and then run WBC_PCAModel.m.clearclccd ~/Desktop/WBC_PCAPipeline/Step3_ModelGenerationWBC_PCAModelNote: These runs will generate a large amount of console output. You may want to go make a cup of coffee.The run will be complete when you see output analogous to the following.CLEAN UP WORKSPACE AND ENVIRONMENTRemoving temporary folderChecking if model file exists on diskElapsed time is 11.682268 seconds.Creating output directory /Users/phillipcompeau/Desktop/WBC_PCAPipeline/Step3_ModelGeneration/reportNumber of objects: 345As a result, the Step3_PCAModel and Step4_Visualization directories have been updated. The principal components along with the assigned label to each cell are captured in the WBC_PCA.csv file within the Step4 directory. Information about the images used and the shape space that CellOrganizer generated can be found in Step3_PCAModel/report/index.html.Note: If you run the WBC_PCAModel.m file more than once, make sure to delete any log and param files that have been created from a previous run. All other files will be overwritten unless preemptively removed from the WBC_PCAModel file’s access. Saving the files can be done by either compressing the files into a zip folder or removing them from the directory.Now that CellOrganizer has vectorized the images and applied PCA to the resulting shape vectors, we would like to explore the resulting vectors for each image. (Recall from the main text that these vectors are the original shape vectors projected onto the “nearest” hyperplane.)First, run the following commands in the MATLAB command window:load('WBC_PCA.mat');scr = model.nuclearShapeModel.score;Then, double-click on the scr variable in the Workspace window.In the matrix on your screen, each row represents the coordinates for the projection of a single image’s shape vector.An important point is that the first d columns in this matrix correspond to the vector’s projection onto the d-dimensional hyperplane minimizing the sum of squared distances from each shape vector to this hyperplane. For the purpose of our shape space visualization, we will only be focusing on the first three columns of this matrix. In this way, even though each shape vector lives in a high-dimensional space, we will obtain a three-dimensional representation of the data that represents the data faithfully.Shape Space VisualizationHaving generated a PCA model from our WBC images, we are now ready to visualize the resulting simplified three-dimensional shape space with each cell labeled according to its type. To do so, we will use Python 3, so make sure you have installed Python 3.In WBC_PCAPipeline/Step4_Visualization of our provided folder, we provide two Python files (WBC_CellFamily.py and WBC_CellType.py) that we will use for plotting to visualize our shape space and label each image. The first file will label each image by cell family (granulocyte, lymphocyte, or monocyte); the second will use five labels, subdividing granulocytes into basophils, eosinophils, and neutrophils.First, we will label each image in the shape space by cell family. Open a new terminal window (the “Terminal” app on MacOS, and the “Command Prompt” app on Windows) and run the following commands to navigate to “Step 4” of the Pipeline and run our cell family plotter.cd ~/Desktop/WBC_PCAPipeline/Step4_Visualizationpython WBC_CellFamily.pyYou can click, drag, and rotate the resulting plot to see the clusters of cell classes by color (a legend can be found in the upper right corner). Furthermore, an image file of this visualization is saved within WBC_PCAPipeline/Step4_Visualization as WBC_ShapeSpace_CF.png.Note: You may need to close the window containing the shape space in order to be able to run additional commands in your terminal window.Next, we will classify images by cell type. In a terminal window, run the following commands to label the shape space according to each of the five cell types. An image of this visualization will be saved within WBC_PCAPipeline/Step4_Visualization as WBC_ShapeSpace_CT.png.cd ~/Desktop/WBC_PCAPipeline/Step4_Visualizationpython WBC_CellType.pyAs we return to the main text, we will show the labeled shape space plots and return to the problem of classification.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Generalizing and Visualizing an Image Shape Space After Applying PCA (in Docker)",
        "category" : "",
        "tags"     : "",
        "url"      : "/white_blood_cells/tutorial_shape_space_docker",
        "date"     : "",
        "content"  : "In a previous tutorial, we segmented and binarized a collection of WBC images. If you completed that tutorial, then you should see those images as a collection of .tiff files in your BWImgs_1 folder inside your WBC_PCAPipeline/Data directory.We are now ready to use CellOrganizer to build a shape space of these images and then apply PCA to the resulting shape vectors in order to reduce the dimension of the dataset.Note: This version of the tutorial is modified to use the Docker implementation of CellOrganizer rather than the (standard) MATLAB implementation. We created this alternative version primarily for Windows users to allow them to run CellOrganizer. Those who are running a Mac/Linux machine may prefer to follow the original tutorial. Note that MATLAB is still required for this version of the tutorial.Necessary SoftwareWe will need to install Docker in order to use this version of CellOrganizer. To do so, follow the instructions here. For Windows users, we also recommend installing a UNIX-like terminal such as Git Bash, which can be downloaded as part of Git for Windows.Note: In order to get Docker to run, it may be necessary for Windows users to set up the Windows Subsystem for Linux. Also, depending on the computer, it may be necessary to modify the computer’s BIOS settings and enable virtualization technology in order to get Docker to run. Consult the help sections on WSL and virtualization for more details.Running CellOrganizer for DockerCellOrganizer for Docker is accessed via a Jupyter notebook server interface. To get started, first ensure that Docker is running by launching the Docker Desktop app. Next, follow the instructions here to start the server.Note: To execute the run.sh script from the instructions above, first navigate to the folder where you saved the file using Git Bash, and execute the command bash ./run.sh. For example, if you saved the file onto your desktop, you would first type in cd ~/Desktop, and then bash ./run.sh to run the bash script.The output from running the commands in the instructions above is shown below. To access the Jupyter notebook server, copy the URL shown at the bottom of the output (highlighted below).Open a web browser, and navigate to the URL you copied above. This will open the Jupyter notebook server in your browser, which contains all of the software needed to run CellOrganizer and create our model.Next, we need to upload our images to the server so that they can be fed as input to the CellOrganizer model. The most straightforward way to do this would be to upload our WBC_PCAPipeline/Data/BWImgs_1 folder onto the server, but unfortunately we can only upload individual files onto the server. Fortunately, there is a simple workaround - Jupyter notebooks allows us to upload zipped folders, so we can instead upload a zipped folder onto the server which contains all of our images.First, compress your BWImgs_1 folder into a .zip file by right-clicking on the folder in  File Explorer and selecting send to &gt; Compressed (zipped) folder. Next, click the upload button near the top-right corner of the Jupyter notebook screen, and double-click on the BWImgs_1.zip file you just created. Then, click the upload button next to the newly added folder.We are now ready to start using CellOrganizer! Create a new IPython notebook on the server named WBC_PCA.ipynb, and enter the following code into a code cell. We will not do a line-by-line walkthrough of the code here, but feel free to compare it with the corresponding MATLAB code contained in Step3_ModelGeneration/WBC_PCAModel.m.! unzip BWImgs_1  # unzip folder - the ! specifies a UNIX command (not python)# import CellOrganizer functionsfrom cellorganizer.tools import img2slml, slml2infoimport osimport sys# Specify model options for CellOrganizeroptions = {'verbose': True,           'debug': False,           'display': False,           'model.name': 'WBC_PCA',           'train.flag': 'framework',           'nucleus.class': 'framework',           'nucleus.type': 'pca',           'cell.class': 'framework',           'cell.type': 'pca',           'skip_preprocessing': True,           # Latent Dimension for the Model           'latent_dim': 15,           # No idea what this is for           'masks': [],           'model.resolution': [0.049, 0.049],           'model.filename': 'WBC_PCA.xml',           'model.id': 'WBC_PCA',           # Set nuclei and cell model name           'nucleus.name': 'WBC_NUC',           'cell.model': 'WBC_CELL',           'documentation.description': 'Trained using demo2D08 from CellOrganizer.'}dimensionality = '2D'# Set path to the binarized segmented imagesdirectory = os.path.join('.', 'BWImgs_1')dna = [os.path.join(directory, 'bw*.tiff')]cellm = [os.path.join(directory, 'bw*.tiff')]# Create the shape space modelimg2slml(dimensionality, dna, cellm, [], options)# img2slml results saved in a MATLAB data file if command run successfully.print("Model output saved successfully:", "WBC_PCA.mat" in os.listdir())The results of running the Python code above will be a new file called WBC_PCA.mat stored on the Jupyter notebook server. Download the file onto your own local computer, and store it in the folder WBC_PCAPipeline/Step3_ModelGeneration.Next, start MATLAB, and set the MATLAB path by clicking the button indicated below, and navigating to your WBC_PCAPipeline/Step3_ModelGeneration folder.Once the path is set, navigate to the Home pane at the top of your MATLAB window, and click on the New Script button. This will open up a new script in your editor window.Enter the following lines of MATLAB code into the newly opened file, which extract and save the principal components from your model to a .csv file:load( [pwd filesep 'WBC_PCA.mat'] );scr = array2table(model.nuclearShapeModel.score);lbls = readtable('../Data/WBC_Labels.csv');mtrx = [lbls scr];writetable(mtrx, '../Step4_Visualization/WBC_PCA.csv');Save the file as extract_and_save_pcs.m in your Step3_ModelGeneration folder. Next, in the MATLAB command window, type inextract_and_save_pcsThis will run the script, and the result will be a new file, WBC_PCA.csv, saved to the folder Step4_Visualization. This file contains the shape vector of each image after PCA has been applied.Note: If you use this file as input for the next tutorial, then you will obtain very slightly different results from those in the text. The reasons why these results do not match are not clear but the conclusions will remain the same.That’s it! You can now follow along the remainder of the tutorial, in which we visualize the post-PCA shape space.Return to main tutorial"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Comparing different chemotaxis default tumbling frequencies",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_tumbling_frequencies",
        "date"     : "",
        "content"  : "In this tutorial, we will run a comparison of the chemotactic random walk over a variety of different background tumbling frequencies. Are some frequencies better than others at helping the bacterium reach the goal?Qualitative comparison of different background tumbling frequenciesFirst, we will use chemotaxis_walk.ipynb from our modified random walk tutorial to compare the trajectories of a few cells for different tumbling frequencies.Specifically, we will run our simulation for three cells over a time period of 800 seconds. We simulate each cell multiple times using a variety of different tumbling frequencies. (We use average tumbling frequencies of 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, and 10.0 seconds.) This will give us a rough idea of what the trajectories look like.duration = 800   #seconds, duration of the simulationnum_cells = 3origin_to_center = euclidean_distance(start, ligand_center) #Update the global constantrun_time_expected_all = [0.5, 1.0, 5.0]paths = np.zeros((len(run_time_expected_all), num_cells, duration + 1, 2))for i in range(len(run_time_expected_all)):    run_time_expected = run_time_expected_all[i]    paths[i] = simulate_chemotaxis(num_cells, duration, run_time_expected)As we did previously, we then plot the trajectories.conc_matrix = np.zeros((3500, 3500))for i in range(3500):    for j in range(3500):        conc_matrix[i][j] = math.log(calc_concentration([i - 500, j - 500]))mycolor = [[256, 256, 256], [256, 255, 254], [256, 253, 250], [256, 250, 240], [255, 236, 209], [255, 218, 185], [251, 196, 171], [248, 173, 157], [244, 151, 142], [240, 128, 128]] #from coolors:)for i in mycolor:    for j in range(len(i)):        i[j] *= (1/256)cmap_color = colors.LinearSegmentedColormap.from_list('my_list', mycolor)for freq_i in range(len(run_time_expected_all)):    fig, ax = plt.subplots(1, figsize = (8, 8))    ax.imshow(conc_matrix.T, cmap=cmap_color, interpolation='nearest', extent = [-500, 3000, -500, 3000], origin = 'lower')    #Plot simulation results    time_frac = 1.0 / duration    #Time progress: dark -&gt; colorful    for t in range(duration):        ax.plot(paths[freq_i,0,t,0], paths[freq_i,0,t,1], 'o', markersize = 1, color = (0.2 * time_frac * t, 0.85 * time_frac * t, 0.8 * time_frac * t))        ax.plot(paths[freq_i,1,t,0], paths[freq_i,1,t,1], 'o', markersize = 1, color = (0.85 * time_frac * t, 0.2 * time_frac * t, 0.9 * time_frac * t))        ax.plot(paths[freq_i,2,t,0], paths[freq_i,2,t,1], 'o', markersize = 1, color = (0.4 * time_frac * t, 0.85 * time_frac * t, 0.1 * time_frac * t))    ax.plot(start[0], start[1], 'ko', markersize = 8)    ax.plot(1500, 1500, 'bX', markersize = 8)    for i in range(num_cells):        ax.plot(paths[freq_i,i,-1,0], paths[freq_i,i,-1,1], 'ro', markersize = 8)    ax.set_title("Background tumbling freq:\n tumble every {} s".format(run_time_expected_all[freq_i]), x = 0.5, y = 0.9, fontsize = 12)    ax.set_xlim(-500, 3000)    ax.set_ylim(-500, 3000)    ax.set_xlabel("poisiton in μm")    ax.set_ylabel("poisiton in μm")plt.show()STOP: Run the code blocks for simulating the random walks and plotting the outcome. Are the cells moving up the gradient? How do the shapes of the trajectories differ for different tumbling frequencies? What value of the average tumbling frequency do you think is best?Comparing tumbling frequencies over many cellsWe will now scale up our simulation to num_cells = 500 cells. To rigorously compare the results of the simulation for different default tumbling frequencies,  we will calculate the average distance to the center at each time step for each tumbling frequency that we use.#Run simulation for 500 cells with different background tumbling frequencies, Plot average distance to highest concentration pointduration = 1500   #seconds, duration of the simulation#num_cells = 500num_cells = 300run_time_expected_all = [0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0]origin_to_center = euclidean_distance(start, ligand_center) #Update the global constantall_distance = np.zeros((len(time_exp), num_cells, duration)) #Initialize to store resultspaths = np.zeros((len(run_time_expected_all), num_cells, duration + 1, 2))for i in range(len(run_time_expected_all)):    run_time_expected = run_time_expected_all[i]    paths[i] = simulate_chemotaxis(num_cells, duration, run_time_expected)for freq_i in range(len(run_time_expected_all)):    for c in range(num_cells):        for t in range(duration):            pos = paths[freq_i, c, t]            dist = euclidean_distance(ligand_center, pos)            all_distance[freq_i, c, t] = distall_dist_avg = np.mean(all_distance, axis = 1)all_dist_std = np.std(all_distance, axis = 1)print(all_dist_avg[0][-10:])We then plot the average distance to the goal over time for each frequency, where each tumbling frequency is assigned a different color.#Below are all for plotting purposes#Define the colors to usecolors1 = colorspace.qualitative_hcl(h=[0, 300.], c = 60, l = 70, pallete = "dynamic")(len(time_exp))xs = np.arange(0, duration)fig, ax = plt.subplots(1, figsize = (10, 8))for freq_i in range(len(time_exp)):    mu, sig = all_dist_avg[freq_i], all_dist_std[freq_i]    ax.plot(xs, mu, lw=2, label="tumble every {} second".format(run_time_expected_all[freq_i]), color=colors1[freq_i])    ax.fill_between(xs, mu + sig, mu - sig, color = colors1[freq_i], alpha=0.1)ax.set_title("Average distance to highest concentration")ax.set_xlabel('time (s)')ax.set_ylabel('distance to center (µm)')ax.legend(loc='lower left', ncol = 1)ax.grid()STOP: Run the code blocks we have provided, simulating the random walks and plotting the average distance to the goal over time for each tumbling frequency. Is there any difference in the performance of the search algorithm for different tumbling frequencies? For each frequency, how long does it take the cell to “reach” the goal? And can we conclude that one tumbling frequency is better than the others?As we return to the main text, we interpret the results of this final tutorial. It turns out that there are significant differences in our chemotaxis algorithm’s ability to find (and remain at) the goal for differing default tumbling frequencies. It hopefully will not surprise you to learn that the frequency that evolution has bestowed upon E. coli turns out to be optimal.Return to main text"
     
   } ,
  
   {
     
        "title"    : "Software Tutorial: Visualizing Specific Regions of Interest within the Spike Protein Structure",
        "category" : "",
        "tags"     : "",
        "url"      : "/coronavirus/tutorial_visualization",
        "date"     : "",
        "content"  : "In this tutorial, we will discuss how to visualize a protein structure and highlight specific amino acids of interest in the protein. We will focus on the region that we identified in the previous tutorial starting at around position 475 of the SARS-CoV-2 RBD, where we found that this RBD differs structurally from that of SARS-CoV.We will visualize the site in the SARS-CoV-2 RBD using the SARS-CoV-2 chimeric RBD complexed with the human ACE2 enzyme (PDB entry: 6vw1). Before completing this tutorial, you should have installed VMD and know how to load molecules into the program. If you need a refresher, visit the previous tutorial.First, download the chimeric RBD .pdb file and load it into VMD.To change the visualization of the protein, click Graphics &gt; Representation. Double clicking on a representation will enable/disable it.The file 6vw1.pdb contains two biological assemblies of the complex. The first assembly contains Chain A (taken from ACE2) and chain E (taken from RBD), and the second assembly contains chain B (ACE2) and chain F (RBD). We will focus on the second assembly.We will first add chain B, from the ACE2 enzyme, and color it green.      Selected Atoms allows you to select specific parts of the molecule. The keyword all selects all atoms in the file, and so replace all with chain B. Then, click Apply. (In general, to choose a specific chain, use the expression chain X, where X is the chain of interest. To choose a specific residue (i.e., amino acid), use the keyword resid #. Expressions can be combined using the keywords and and or, and more complicated selections need parentheses.        Coloring Method allows you to change the coloring method of the selected atoms. This includes coloring based on element, amino acid residue, and many more. To choose a specific color, select ColorID. A drop-down list will appear to color selection. Choose “7” to select green.        Drawing Method allows you to change the visualization of the selected atoms. Lines (also known as wireframe) draws a line between atoms to represent bonds. Tube focuses only on the backbone of the molecule.  Licorice will show both the backbone and the side chains of the protein. Cartoon/NewCartoon will show the secondary structure of the molecule (protein). We are interesed mostly in the backbone, and so we will choose Tube.  At this point, your OpenGL Display window should look like the following:We next add chain F, from the SARS-CoV-2 chimeric RBD, and color it purple. Click Create Rep, which will duplicate the previous representation. Then, change Selected Atoms to chain F and ColoringID to “11” to color the chain purple. Make sure your other selections are as follows:You should now see two distinct colored structures!We can also change our visualization to target specific amino acids by creating another representation and specifying the amino acid with the keyword resid followed by the position of this amino acid residue.For example, say that we are interested in residue 486 in the RBD (which is phenylalanine). Click Create Rep. In the new representation, change Selected Atoms to chain F and resid 486 and click Apply. Then change the Coloring Method to ColorID and 4. Finally, change the Drawing Method to Licorice.In the OpenGL Display window, you will now see a new yellow projection coming out of the RBD, as shown in the image below. This is residue 486! You may need to rotate the protein a bit to see it. (Instructions on how to rotate a molecule and zoom in and out within VMD were given in our tutorial on finding local protein differences.)Let’s now color a few more residues from our region of interest: residues 475 and 487 of the RBD, and residues 19, 24, 79, 82, and 83 of ACE2. As we return to the main text, we will explain why these residues are implicated in binding affinity.Coloring these residues is analogous to the previous steps of just adding new representations and changing Selected Atoms, Coloring Method, and Drawing Method. Make the following representations; note the colors that we use.Your final visualization should look like the following figure.Congratulations! You have now created a detailed visualization of the RBD-ACE2 complex that focuses on our site of interest. As we return to the main text, we will discuss how the highlighted amino acids help increase the binding affinity of the SARS-CoV-2 spike protein to ACE2.STOP: Create another visualization of the same site using the SARS-CoV RBD complex with ACE2 (PDB entry: 2ajf). How does it compare with your first visualization of SARS-CoV-2 complex? Use the graphical representations shown in the table below.            Protein      Style      Color      Selection                  SARS-CoV RBD      Tube      ColorID 11      chain F              SARS-CoV RBD      Licorice      ColorID 4      chain F and resid 472              ACE2      Licorice      ColorID 6      chain B and (resid 82 or resid 79 or resid 83)              ACE2      Licorice      ColorID 10      chain B and resid 19              ACE2      Licorice      ColorID 3      chain B and resid 24      Return to main text"
     
   } ,
  
   {
     
        "title"    : "Chemotactic random walk",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/tutorial_walk",
        "date"     : "",
        "content"  : "In a previous tutorial, we simulated the movement of a cell moving randomly throughout two-dimensional space in a sequence of steps. At each step, the next direction of the cell’s movement is chosen completely randomly. We called this simple algorithm “strategy 1” in the main text.In this tutorial, we will adapt this simulation into one that attempts to more closely mimic the real behavior of E. coli chemotaxis, based on what we have learned in this module. We will then be able to compare the results of these two algorithms.Make sure that the following dependencies are installed:            Installation Link      Version      Check install/version                  Python3      3.6+      python --version              Jupyter Notebook      4.4.0+      jupyter --version              Numpy      1.14.5+      pip list | grep numpy              Matplotlib      3.0+      pip list | grep matplotlib              Colorspace or with pip      any      pip list | grep colorspace      The chemotactic walk reassesses run length based on relative attractant concentrationWe will use the run-and-tumble model introduced in the random walk tutorial as a basis for building a more realistic model of bacterial movement. Recall that this previous simulation involved the following components.  Run. The duration of a cell’s run follows an exponential distribution with mean equal to the background run duration run_time_expected.  Tumble. The duration of a cell’s tumble follows an exponential distribution with mean 0.1s1. When it tumbles, we assume that the cell changes its orientation but does not change its position. The degree of reorientation is a random number sampled uniformly between  and 360°.  Gradient. We model an exponential gradient with a goal (1500, 1500) having a concentration of 108. All cells start at the origin (0, 0), which has a concentration of 102. The ligand concentration at the point (x, y) is given by L(x, y) = 100 · 108 · (1-d/D), where d is the distance from (x, y) to the goal, and D is the distance from the origin to the goal; in this case, D is 1500√2  2121 µm.In this tutorial, we will modify this simulation so that the duration of a run is based on the relative change of concentration of attractant at the cell’s current point compared to its previous point.In the main text, we stated that we would model a chemotactic strategy by sampling from an exponential distribution every tresponse seconds (tresponse is called the response time), where the mean of the exponential distribution changes based on the relative change in concentration. Specifically, we let t0 denote the mean background run duration and Δ[L] denote the percentage difference between the ligand concentration L(x, y) at the cell’s current point and the ligand concentration at the cell’s previous point.Then, to determine whether the cell will tumble, we perform the following steps.  We take the maximum of 0.000001 and t0 * (1 + 10 · Δ[L]).  We take the minimum of the resulting value and 4 · t0.  We set the resulting value as the mean of an exponential distribution and sample a run time p from this distribution.  If p is smaller than tresponse, then the cell will tumble after p seconds. Otherwise, it continues in its current direction for tresponse seconds, at which time it will repeat steps 1-4.We continue this process of running and tumbling for the total duration of the simulation, where every tresponse seconds, we use the above steps to assess whether or not to tumble in the next time interval.This algorithm is summarized by the following Python code, which calls a function run_duration() to determine the length of a single run. This algorithm uses a value of response_time of 0.5 seconds, since this is the approximate time that we have observed that it takes E. coli to change its behavior in response to an attractant. The total time of the simulation is given as a parameter duration in seconds. # This function performs simulation # Input: number of cells to simulate (int), how many seconds (int), the expected run time before tumble (float) # Return: the simulated trajectories paths: array of shape (num_cells, duration+1, 2) def simulate_chemotaxis(num_cells, duration, run_time_expected):     #Takes the shape (num_cells, duration+1, 2)     #any point [x,y] on the simulated trajectories can be accessed via paths[cell, time]     paths = np.zeros((num_cells, duration + 1, 2))     for rep in range(num_cells):         # Initialize simulation         t = 0 #record the time elapse         curr_position = np.array(start) #start at [0, 0]         past_conc = calc_concentration(start) #Initialize concentration         projection_h, projection_v, tumble_time = tumble_move() #Initialize direction randomly         while t &lt; duration:             curr_conc = calc_concentration(curr_position)             curr_run_time = run_duration(curr_conc, past_conc, curr_position, run_time_expected) #get run duration, float             # if run time (r) is within the step (s), run for r second and then tumble             if curr_run_time &lt; response_time:                 #displacement on either direction is calculated as the projection * speed * time                 #update current position by summing old position and displacement                 curr_position = curr_position + np.array([projection_h, projection_v]) * speed * curr_run_time                 projection_h, projection_v, tumble_time = tumble_move() #tumble                 t += (curr_run_time + tumble_time) #increment time             # if r &gt; s, run for r; then it will be in the next iteration             else:                 #displacement on either direction is calculated as the projection * speed * time                 #update current position by summing old position and displacement                 curr_position = curr_position + np.array([projection_h, projection_v]) * speed * response_time                 t += response_time #no tumble here             #record position approximate for integer number of second             curr_sec = int(t)             if curr_sec &lt;= duration:                 #fill values from last time point to current time point                 paths[rep, curr_sec] = curr_position.copy()                 past_conc = curr_conc     return pathsWe now provide code for the function run_duration. This function samples a random number from an exponential distribution whose mean is equal to min(4 · t0, max(0.000001, t0 · (1 + 10 · Δ[L]))). Note that before we compute this formula, we ensure that the current concentration is not greater than some maximum concentration saturation_conc at which the concentration of attractant is saturated.# Calculate the wait time for next tumbling event# Input: current concentration (float), past concentration (float), position (array [x, y]), expected run time (float)# Return: duration of current run (float)def run_duration(curr_conc, past_conc, position, run_time_expected):  curr_conc = min(curr_conc, saturation_conc) #Can't detect higher concentration if receptors saturates  past_conc = min(past_conc, saturation_conc)  change = (curr_conc - past_conc) / past_conc #proportion change in concentration, float  run_time_expected_adj_conc = run_time_expected * (1 + 10 * change) #adjust based on concentration change, float  if run_time_expected_adj_conc &lt; 0.000001:      run_time_expected_adj_conc = 0.000001 #positive wait times  elif run_time_expected_adj_conc &gt; 4 * run_time_expected:      run_time_expected_adj_conc = 4 * run_time_expected     #the decrease to tumbling frequency is only to a certain extent  #Sample the duration of current run from exponential distribution, mean=run_time_expected_adj_conc  curr_run_time = np.random.exponential(run_time_expected_adj_conc)  return curr_run_timeComparing the performance of the two strategiesNow that we have modified our random walk simulation to be more biologically accurate, we will compare the performance of these cells against those following the original random walk. How much better do the cells following the biologically accurate strategy fare?To do so, we will provide a Jupyter notebook here: chemotaxis_compare.ipynb.Qualitative comparisonWe will first visualize the trajectories of three cells following each of our two strategies. To do so, first initialize the model by running the code for Part 1: Model specification.The following code simulates three cells for 800 seconds for each of the two strategies.#Run simulation for 3 cells for each strategy, plot pathsduration = 800   #seconds, duration of the simulationnum_cells = 3origin_to_center = distance(start, ligand_center) #Update the global constantrun_time_expected = 1.0paths_rand = simulate_std_random(num_cells, duration, run_time_expected)paths_che = simulate_chemotaxis(num_cells, duration, run_time_expected)paths = np.array([paths_rand, paths_che])Now that we have simulated the cells, we will visualize the results of their walks. The plotting is similar as in the random walk tutorial, except that this time, we will have two subplots, one for the pure random walk strategy, and the other for the chemotactic random walk. (These subplots are initialized using plt.subplots(1, 2).)#Below are all for plotting purposesmethods = ["Pure random walk", "Chemotactic random walk"]fig, ax = plt.subplots(1, 2, figsize = (16, 8)) #1*2 subplots, size 16*8#First set color mapmycolor = [[256, 256, 256], [256, 255, 254], [256, 253, 250], [256, 250, 240], [255, 236, 209], [255, 218, 185], [251, 196, 171], [248, 173, 157], [244, 151, 142], [240, 128, 128]] #from coolors:)for i in mycolor:    for j in range(len(i)):        i[j] *= (1/256)cmap_color = colors.LinearSegmentedColormap.from_list('my_list', mycolor) #Linearly segment these colors to create a continuous color map#Store the concentrations for each integer position in a matrixconc_matrix = np.zeros((4000, 4000)) #we will display from [-1000, -1000] to [3000, 3000]for i in range(4000):    for j in range(4000):        conc_matrix[i][j] = math.log(calc_concentration([i - 1000, j - 1000]))#Repeat for the two strategiesfor m in range(2):    #Simulate the gradient distribution, plot as a heatmap    ax[m].imshow(conc_matrix.T, cmap=cmap_color, interpolation='nearest', extent = [-1000, 3000, -1000, 3000], origin = 'lower')    #Plot simulation results    time_frac = 1.0 / duration    #Plot the trajectories. Time progress: dark -&gt; colorful    for t in range(duration):        ax[m].plot(paths[m,0,t,0], paths[m,0,t,1], 'o', markersize = 1, color = (0.2 * time_frac * t, 0.85 * time_frac * t, 0.8 * time_frac * t))        ax[m].plot(paths[m,1,t,0], paths[m,1,t,1], 'o', markersize = 1, color = (0.85 * time_frac * t, 0.2 * time_frac * t, 0.9 * time_frac * t))        ax[m].plot(paths[m,2,t,0], paths[m,2,t,1], 'o', markersize = 1, color = (0.4 * time_frac * t, 0.85 * time_frac * t, 0.1 * time_frac * t))    ax[m].plot(start[0], start[1], 'ko', markersize = 8) #Mark the starting point [0, 0]    for i in range(num_cells):        ax[m].plot(paths[m,i,-1,0], paths[m,i,-1,1], 'ro', markersize = 8) #Mark the terminal points for each cell    ax[m].plot(1500, 1500, 'bX', markersize = 8) #Mark the highest concentration point [1500, 1500]    ax[m].set_title("{}\n Average tumble every 1 s".format(methods[m]), x = 0.5, y = 0.87)    ax[m].set_xlim(-1000, 3000)    ax[m].set_ylim(-1000, 3000)    ax[m].set_xlabel("position in μm")    ax[m].set_ylabel("position in μm")fig.tight_layout()plt.show()You are now ready to run the code for Part 2: Visualizing trajectories. Do you notice a difference in the two strategies in helping the cell travel toward the goal?Quantitative comparisonIf you performed the plotting in the previous section, then you may have formed a hypothesis about the effectiveness of the chemotactic strategy compared to the pure random walk. However, because of the variations due to randomness, we should be careful about using only three cells as our sample size. To more rigorously compare the two strategies, we will simulate 500 cells for 1500 seconds for each strategy and consider how close, on average, the cell is to the goal at the end for each strategy.As in the previous section, we first simulate each of the two strategies for the desired number of cells, and store the results of the walk for each cell. We also compute the average and standard deviation of the distance from a cell to the goal for each of the two strategies.#Run simulation for 3 cells with different background tumbling frequencies, Plot pathsduration = 1500   #seconds, duration of the simulationnum_cells = 500origin_to_center = distance(start, ligand_center) #Update the global constantrun_time_expected = 1.0paths_rand = simulate_std_random(num_cells, duration, run_time_expected)paths_che = simulate_chemotaxis(num_cells, duration, run_time_expected)paths = np.array([paths_rand, paths_che])all_distance = np.zeros((2, num_cells, duration)) #Initialize to store results: methods, number, durationfor m in range(2): #two methods    for c in range(num_cells): #each cell        for t in range(duration): #every time point            pos = paths[m, c, t]            dist = distance(ligand_center, pos)            all_distance[m, c, t] = distall_dist_avg = np.mean(all_distance, axis = 1) #Calculate average over cells, array of shape (2,duration,)all_dist_std = np.std(all_distance, axis = 1) #Calculate the standard deviation, array of shape (2,duration,)Then, for each of the two strategies, we plot the average distance to the goal as a function of time, as we did in the random walk tutorial. Recall that the shaded area corresponds to one standard deviation from the mean.#Below are all for plotting purposes#Define the colors to usecolors1 = colorspace.qualitative_hcl(h=[0, 200.], c = 60, l = 70, pallete = "dynamic")(2)xs = np.arange(0, duration) #Set the x-axis for plot: time points. Array of integers of shape (duration,)fig, ax = plt.subplots(1, figsize = (10, 8)) #Initialize the plot with 1*1 subplot of size 10*8for m in range(2):    #Get the result for this strategy    mu, sig = all_dist_avg[m], all_dist_std[m]    #Plot average distance vs. time    ax.plot(xs, mu, lw=2, label="{}".format(methods[m]), color=colors1[m])    #Fill in average +/- one standard deviation vs. time    ax.fill_between(xs, mu + sig, mu - sig, color = colors1[m], alpha=0.15)ax.set_title("Average distance to highest concentration")ax.set_xlabel('time (s)')ax.set_ylabel('distance to center (µm)')ax.hlines(0, 0, duration, colors='gray', linestyles='dashed', label='concentration 10^8')ax.legend(loc='upper right', ncol = 2, fontsize = 15)ax.grid()You are now ready to run the code in Part 3: Comparing performances. Consider whether you feel confident in your hypothesis about the performance of the two cellular strategies before we discuss our analysis back in the main text.Return to main text            Saragosti J., Siberzan P., Buguin A. 2012. Modeling E. coli tumbles by rotational diffusion. Implications for chemotaxis. PLoS One 7(4):e35412. available online. &#8617;      "
     
   } ,
  
   {
     
        "title"    : "E. coli Explores its World Via a Random Walk",
        "category" : "",
        "tags"     : "",
        "url"      : "/chemotaxis/walk",
        "date"     : "",
        "content"  : "Bacterial runs and tumblesEvery E. coli cell has between five and twelve flagella distributed on its surface1 that can rotate both clockwise and counter-clockwise. When all flagella are rotating counter-clockwise, they form a bundle and propel the cell forward at about 20 µm per second. This speed may seem insignificant, but it is about ten times the length of the cell per second, which is analogous to a car traveling at 160 kph. When a single flagellum rotates clockwise, the flagella become uncoordinated, and the bacterium stops and rotates.2When we examine the bacterium’s movement under a microscope, we see it alternate between periods of “running” in a straight line and then “tumbling” in place (see figure below). Over time, the bacterium’s run and tumble exploration amounts to a random walk through its environment, like the exploration approach used by the lost immortals in this module’s introduction.The run and tumble mechanism of bacterial movement produces a random walk (bottom left). Image courtesy: Sandy Parkinson.STOP: Say that a bacterium travels 20 µm per second, and every second it chooses a random direction in which to travel.  After an hour, approximately how far do we expect it to be from its starting point? (Hint: recall the Random Walk Theorem from the prologue.)Tumbling frequency is constant across speciesBacteria are amazingly diverse. They have evolved for over three billion years to thrive in practically every environment on the planet, including hazardous human-made environments. They manufacture compounds such as antibiotics that larger organisms like ourselves cannot make. Some eukaryotes are even completely dependent upon bacteria to perform some critical task for them, from digesting their food, to camouflaging them from predators, to helping them develop organs3.And yet despite the diversity of the bacterial kingdom, variations in bacterial tumbling frequencies are relatively small. In the absence of an attractant or repellent, E. coli stops to tumble once every 1 to 1.5 seconds45, which is similar to most other bacteria.678 It is as if some invisible force compels all these bacteria to tumble with the same frequency. Recalling Dobzhansky’s quotation from our work in a previous module that “nothing in biology makes sense except in the light of evolution”, we wonder why evolution might hold tumbling frequency constant across species.This question is a fundamental one, and we will return to it at the close of this module after we have learned more about the biochemical basis of chemotaxis and how a bacterium can adjust its behavior in response to a chemical substance. In the process, we will see that despite bacteria being simple organisms, the mechanism that they use to implement chemotaxis is sophisticated and beautiful.Next lesson            Sim M, Koirala S, Picton D, Strahl H, Hoskisson PA, Rao CV, Gillespie CS, Aldridge PD. 2017. Growth rate control of flagellar assembly in Escherichia coli strain RP437. Scientific Reports 7:41189. Available online &#8617;              Baker MD, Wolanin PM, Stock JB. 2005. Signal transduction in bacterial chemotaxis. BioEssays 28:9-22. Available online &#8617;              Ed Yong. I Contain Multitudes: The Microbes Within Us and a Grander View of Life. &#8617;              Weis RM, Koshland DE. 1990. Chemotaxis in Escherichia coli proceeds efficiently from different initial tumble frequencies. Journal of Bacteriology 172:2. Available online &#8617;              Berg HC. 2000. Motile behavior of bacteria. Physics today 53(1):24. Available online &#8617;              Achouri S, Wright JA, Evans L, Macleod C, Fraser G, Cicuta P, Bryant CE. 2015. The frequency and duration of Salmonella macrophage adhesion events determines infection efficiency. Philosophical transactions B 370(1661). Available online &#8617;              Turner L, Ping L, Neubauer M, Berg HC. 2016. Visualizing flagella while tracking bacteria. Biophysical Journal 111(3):630–639.Available online &#8617;              Gotz R and Schmitt R. 1987. Rhizobium meliloti swims by unidirectional, intermittent rotation of right-handed flagellar helices. J Bacteriol 169: 3146–3150. Avaialbe online &#8617;      "
     
   } ,
  
   {
     
   } ,
  
   {
     
   } ,
  
   {
     
   } ,
  
   {
     
   } ,
  
   {
     
        "title"    : "v1.0.3 Short Fuse",
        "category" : "",
        "tags"     : "",
        "url"      : "/node_modules/fuzzysearch/CHANGELOG.html",
        "date"     : "",
        "content"  : "v1.0.3 Short Fuse  Improved circuit-breaker when needle and haystack length are equalv1.0.2 Vodka Tonic  Slightly updated circuit-breaker that tests for equal length first  Doubled method performance (see jsperf tests)v1.0.1 Circuit Breaker  Introduced a circuit-breaker where queries longer than the searched string will return false  Introduced a circuit-breaker where queries identical to the searched string will return true  Introduced a circuit-breaker where text containing the entire query will return truev1.0.0 IPO  Initial Public Release"
     
   } ,
  
   {
     
        "title"    : "fuzzysearch",
        "category" : "",
        "tags"     : "",
        "url"      : "/node_modules/fuzzysearch/",
        "date"     : "",
        "content"  : "fuzzysearch  Tiny and blazing-fast fuzzy search in JavaScriptFuzzy searching allows for flexibly matching a string with partial input, useful for filtering data very quickly based on lightweight user input.DemoTo see fuzzysearch in action, head over to bevacqua.github.io/horsey, which is a demo of an autocomplete component that uses fuzzysearch to filter out results based on user input.InstallFrom npmnpm install --save fuzzysearchfuzzysearch(needle, haystack)Returns true if needle matches haystack using a fuzzy-searching algorithm. Note that this program doesn’t implement levenshtein distance, but rather a simplified version where there’s no approximation. The method will return true only if each character in the needle can be found in the haystack and occurs after the preceding character.fuzzysearch('twl', 'cartwheel') // &lt;- truefuzzysearch('cart', 'cartwheel') // &lt;- truefuzzysearch('cw', 'cartwheel') // &lt;- truefuzzysearch('ee', 'cartwheel') // &lt;- truefuzzysearch('art', 'cartwheel') // &lt;- truefuzzysearch('eeel', 'cartwheel') // &lt;- falsefuzzysearch('dog', 'cartwheel') // &lt;- falseAn exciting application for this kind of algorithm is to filter options from an autocomplete menu, check out horsey for an example on how that might look like.But! RegExps…!LicenseMIT"
     
   } ,
  
   {
     
        "title"    : "Simple-Jekyll-Search",
        "category" : "",
        "tags"     : "",
        "url"      : "/node_modules/simple-jekyll-search/",
        "date"     : "",
        "content"  : "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[![Build Status](https://img.shields.io/travis/christian-fei/Simple-Jekyll-Search/master.svg?)](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[![dependencies Status](https://img.shields.io/david/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[![devDependencies Status](https://img.shields.io/david/dev/christian-fei/Simple-Jekyll-Search.svg)](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[  {% for post in site.posts %}    {      "title"    : "{{ post.title | escape }}",      "category" : "{{ post.category }}",      "tags"     : "{{ post.tags | join: ', ' }}",      "url"      : "{{ site.baseurl }}{{ post.url }}",      "date"     : "{{ post.date }}"    } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in  **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById('search-input'),  resultsContainer: document.getElementById('results-container'),  json: '/search.json'})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search('Hello')```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting & guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({  searchInput: document.getElementById('search-input'),  resultsContainer: document.getElementById('results-container'),  json: '/search.json',  searchResultTemplate: '{title}'})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[    {      "title"    : "Welcome to Jekyll!",      "category" : "",      "tags"     : "",      "url"      : "/jekyll/update/2014/11/01/welcome-to-jekyll.html",      "date"     : "2014-11-01 21:07:22 +0100"    }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({  ...  templateMiddleware: function(prop, value, template) {    if (prop === 'bar') {      return value.replace(/^\//, '')    }  }  ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({  ...  sortMiddleware: function(a, b) {    var astr = String(a.section) + "-" + String(a.caption);    var bstr = String(b.section) + "-" + String(b.caption);    return astr.localeCompare(bstr)  }  ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn't match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn't working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json"content": "{{ page.content | strip_html | strip_newlines }}"```with```json"content": "{{ page.content | strip_html | strip_newlines | remove_chars | escape }}"```If this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js"content": {{ page.content | jsonify }}```**Note: you don't need to use quotes `"` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[  {% for post in site.posts %}    {      "title"    : "{{ post.title | escape }}",      "category" : "{{ post.category }}",      "tags"     : "{{ post.tags | join: ', ' }}",      "url"      : "{{ site.baseurl }}{{ post.url }}",      "date"     : "{{ post.date }}",      "content"  : "{{ post.content | strip_html | strip_newlines }}"    } {% unless forloop.last %},{% endunless %}  {% endfor %}  ,  {% for page in site.pages %}   {     {% if page.title != nil %}        "title"    : "{{ page.title | escape }}",        "category" : "{{ page.category }}",        "tags"     : "{{ page.tags | join: ', ' }}",        "url"      : "{{ site.baseurl }}{{ page.url }}",        "date"     : "{{ page.date }}",        "content"  : "{{ page.content | strip_html | strip_newlines }}"     {% endif %}   } {% unless forloop.last %},{% endunless %}  {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)> [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[![Stargazers over time](https://starchart.cc/christian-fei/Simple-Jekyll-Search.svg)](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"
     
   } ,
  
   {
     
   } ,
  
   {
     
   } 
  
]

Development

Acceptance tests

cd example; jekyll serve

# in another tab

npm run cypress -- run

Contributors

Thanks to all contributors over the years! You are the best :)

@daviddarnes @XhmikosR @PeterDaveHello @mikeybeck @egladman @midzer @eduardoboucas @kremalicious @tibotiber and many others!

Stargazers over time

Stargazers over time