A JavaScript library to add search functionality to any Jekyll blog.
You have a blog, built with Jekyll, and want a lightweight search functionality on your blog, purely client-side?
No server configurations or databases to maintain.
Just 5 minutes to have a fully working searchable blog.
npm install simple-jekyll-search
search.json
Place the following code in a file called search.json
in the root of your Jekyll blog. (You can also get a copy from here)
This file will be used as a small data source to perform the searches on the client side:
---
layout: none
---
[
]
SimpleJekyllSearch needs two DOM
elements to work:
Here is the code you can use with the default configuration:
You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)
For example in _layouts/default.html:
<!-- HTML elements for search -->
<input type="text" id="search-input" placeholder="Search blog posts..">
<ul id="results-container"></ul>
<!-- or without installing anything -->
<script src="https://unpkg.com/simple-jekyll-search@latest/dest/simple-jekyll-search.min.js"></script>
Customize SimpleJekyllSearch by passing in your configuration options:
var sjs = SimpleJekyllSearch({
searchInput: document.getElementById('search-input'),
resultsContainer: document.getElementById('results-container'),
json: '/search.json'
})
A new instance of SimpleJekyllSearch returns an object, with the only property search
.
search
is a function used to simulate a user input and display the matching results.
E.g.:
var sjs = SimpleJekyllSearch({ ...options })
sjs.search('Hello')
💡 it can be used to filter posts by tags or categories!
Here is a list of the available options, usage questions, troubleshooting & guides.
The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.
The container element in which the search results should be rendered in. Typically a <ul>
.
You can either pass in an URL to the search.json
file, or the results in form of JSON directly, to save one round trip to get the data.
The template of a single rendered search result.
The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.
E.g.
The template
var sjs = SimpleJekyllSearch({
searchInput: document.getElementById('search-input'),
resultsContainer: document.getElementById('results-container'),
json: '/search.json',
searchResultTemplate: '<li><a href="https://biologicalmodeling.org{url}">{title}</a></li>'
})
will render to the following
<li><a href="/jekyll/update/2014/11/01/welcome-to-jekyll.html">Welcome to Jekyll!</a></li>
If the search.json
contains this data
[
{
"title" : "Welcome to Jekyll!",
"category" : "",
"tags" : "",
"url" : "/jekyll/update/2014/11/01/welcome-to-jekyll.html",
"date" : "2014-11-01 21:07:22 +0100"
}
]
A function that will be called whenever a match in the template is found.
It gets passed the current property name, property value, and the template.
If the function returns a non-undefined value, it gets replaced in the template.
This can be potentially useful for manipulating URLs etc.
Example:
SimpleJekyllSearch({
...
templateMiddleware: function(prop, value, template) {
if (prop === 'bar') {
return value.replace(/^\//, '')
}
}
...
})
See the tests for an in-depth code example
A function that will be used to sort the filtered results.
It can be used for example to group the sections together.
Example:
SimpleJekyllSearch({
...
sortMiddleware: function(a, b) {
var astr = String(a.section) + "-" + String(a.caption);
var bstr = String(b.section) + "-" + String(b.caption);
return astr.localeCompare(bstr)
}
...
})
The HTML that will be shown if the query didn’t match anything.
You can limit the number of posts rendered on the page.
Enable fuzzy search to allow less restrictive matching.
Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).
A function called once the data has been loaded.
Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no debounceTime
(milliseconds) is provided a search will be triggered on each keystroke.
remove_chars
as a filter.For example: in search.json, replace
"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[ {% for post in site.posts %} { "title" : "{{ post.title | escape }}", "category" : "{{ post.category }}", "tags" : "{{ post.tags | join: ', ' }}", "url" : "{{ site.baseurl }}{{ post.url }}", "date" : "{{ post.date }}" } {% unless forloop.last %},{% endunless %} {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({ searchInput: document.getElementById('search-input'), resultsContainer: document.getElementById('results-container'), json: '/search.json'})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search('Hello')```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting & guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({ searchInput: document.getElementById('search-input'), resultsContainer: document.getElementById('results-container'), json: '/search.json', searchResultTemplate: '{title}'})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[ { "title" : "Welcome to Jekyll!", "category" : "", "tags" : "", "url" : "/jekyll/update/2014/11/01/welcome-to-jekyll.html", "date" : "2014-11-01 21:07:22 +0100" }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({ ... templateMiddleware: function(prop, value, template) { if (prop === 'bar') { return value.replace(/^\//, '') } } ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({ ... sortMiddleware: function(a, b) { var astr = String(a.section) + "-" + String(a.caption); var bstr = String(b.section) + "-" + String(b.caption); return astr.localeCompare(bstr) } ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn't match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn't working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json"content": "{{ page.content | strip_html | strip_newlines }}"```with```json"content": "{{ page.content | strip_html | strip_newlines | remove_chars | escape }}"```If this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js"content": {{ page.content | jsonify }}```**Note: you don't need to use quotes `"` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[ {% for post in site.posts %} { "title" : "{{ post.title | escape }}", "category" : "{{ post.category }}", "tags" : "{{ post.tags | join: ', ' }}", "url" : "{{ site.baseurl }}{{ post.url }}", "date" : "{{ post.date }}", "content" : "{{ post.content | strip_html | strip_newlines }}" } {% unless forloop.last %},{% endunless %} {% endfor %} , {% for page in site.pages %} { {% if page.title != nil %} "title" : "{{ page.title | escape }}", "category" : "{{ page.category }}", "tags" : "{{ page.tags | join: ', ' }}", "url" : "{{ site.baseurl }}{{ page.url }}", "date" : "{{ page.date }}", "content" : "{{ page.content | strip_html | strip_newlines }}" {% endif %} } {% unless forloop.last %},{% endunless %} {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)> [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"
with
"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)[](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)[](https://david-dm.org/christian-fei/Simple-Jekyll-Search)[](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)A JavaScript library to add search functionality to any Jekyll blog.## Use caseYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?*No server configurations or databases to maintain*.Just **5 minutes** to have a **fully working searchable blog**.---## Installation### npm```shnpm install simple-jekyll-search```## Getting started### Create `search.json`Place the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))This file will be used as a small data source to perform the searches on the client side:```yaml---layout: none---[ {% for post in site.posts %} { "title" : "{{ post.title | escape }}", "category" : "{{ post.category }}", "tags" : "{{ post.tags | join: ', ' }}", "url" : "{{ site.baseurl }}{{ post.url }}", "date" : "{{ post.date }}" } {% unless forloop.last %},{% endunless %} {% endfor %}]```## Preparing the plugin### Add DOM elementsSimpleJekyllSearch needs two `DOM` elements to work:- a search input field- a result container to display the results#### Give me the codeHere is the code you can use with the default configuration:You need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)For example in **_layouts/default.html**:```html```## UsageCustomize SimpleJekyllSearch by passing in your configuration options:```jsvar sjs = SimpleJekyllSearch({ searchInput: document.getElementById('search-input'), resultsContainer: document.getElementById('results-container'), json: '/search.json'})```### returns { search }A new instance of SimpleJekyllSearch returns an object, with the only property `search`.`search` is a function used to simulate a user input and display the matching results. E.g.:```jsvar sjs = SimpleJekyllSearch({ ...options })sjs.search('Hello')```💡 it can be used to filter posts by tags or categories!## OptionsHere is a list of the available options, usage questions, troubleshooting & guides.### searchInput (Element) [required]The input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.### resultsContainer (Element) [required]The container element in which the search results should be rendered in. Typically a ``.### json (String|JSON) [required]You can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.### searchResultTemplate (String) [optional]The template of a single rendered search result.The templating syntax is very simple: You just enclose the properties you want to replace with curly braces.E.g.The template```jsvar sjs = SimpleJekyllSearch({ searchInput: document.getElementById('search-input'), resultsContainer: document.getElementById('results-container'), json: '/search.json', searchResultTemplate: '{title}'})```will render to the following```htmlWelcome to Jekyll!```If the `search.json` contains this data```json[ { "title" : "Welcome to Jekyll!", "category" : "", "tags" : "", "url" : "/jekyll/update/2014/11/01/welcome-to-jekyll.html", "date" : "2014-11-01 21:07:22 +0100" }]```### templateMiddleware (Function) [optional]A function that will be called whenever a match in the template is found.It gets passed the current property name, property value, and the template.If the function returns a non-undefined value, it gets replaced in the template.This can be potentially useful for manipulating URLs etc.Example:```jsSimpleJekyllSearch({ ... templateMiddleware: function(prop, value, template) { if (prop === 'bar') { return value.replace(/^\//, '') } } ...})```See the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example### sortMiddleware (Function) [optional]A function that will be used to sort the filtered results.It can be used for example to group the sections together.Example:```jsSimpleJekyllSearch({ ... sortMiddleware: function(a, b) { var astr = String(a.section) + "-" + String(a.caption); var bstr = String(b.section) + "-" + String(b.caption); return astr.localeCompare(bstr) } ...})```### noResultsText (String) [optional]The HTML that will be shown if the query didn't match anything.### limit (Number) [optional]You can limit the number of posts rendered on the page.### fuzzy (Boolean) [optional]Enable fuzzy search to allow less restrictive matching.### exclude (Array) [optional]Pass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).### success (Function) [optional]A function called once the data has been loaded.### debounceTime (Number) [optional]Limit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.---## If search isn't working due to invalid JSON- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.For example: in search.json, replace```json"content": "{{ page.content | strip_html | strip_newlines }}"```with```json"content": "{{ page.content | strip_html | strip_newlines | remove_chars | escape }}"```If this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:```js"content": {{ page.content | jsonify }}```**Note: you don't need to use quotes `"` in this since `jsonify` automatically inserts them.**## Enabling full-text searchReplace `search.json` with the following code:```yaml---layout: none---[ {% for post in site.posts %} { "title" : "{{ post.title | escape }}", "category" : "{{ post.category }}", "tags" : "{{ post.tags | join: ', ' }}", "url" : "{{ site.baseurl }}{{ post.url }}", "date" : "{{ post.date }}", "content" : "{{ post.content | strip_html | strip_newlines }}" } {% unless forloop.last %},{% endunless %} {% endfor %} , {% for page in site.pages %} { {% if page.title != nil %} "title" : "{{ page.title | escape }}", "category" : "{{ page.category }}", "tags" : "{{ page.tags | join: ', ' }}", "url" : "{{ site.baseurl }}{{ page.url }}", "date" : "{{ page.date }}", "content" : "{{ page.content | strip_html | strip_newlines }}" {% endif %} } {% unless forloop.last %},{% endunless %} {% endfor %}]```## Development- `npm install`- `npm test`#### Acceptance tests```bashcd example; jekyll serve# in another tabnpm run cypress -- run```## ContributorsThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)> [@daviddarnes](https://github.com/daviddarnes)[@XhmikosR](https://github.com/XhmikosR)[@PeterDaveHello](https://github.com/PeterDaveHello)[@mikeybeck](https://github.com/mikeybeck)[@egladman](https://github.com/egladman)[@midzer](https://github.com/midzer)[@eduardoboucas](https://github.com/eduardoboucas)[@kremalicious](https://github.com/kremalicious)[@tibotiber](https://github.com/tibotiber)and many others!## Stargazers over time[](https://starchart.cc/christian-fei/Simple-Jekyll-Search)"
If this doesn’t work when using Github pages you can try jsonify
to make sure the content is json compatible:
"content": "# [Simple-Jekyll-Search](https://www.npmjs.com/package/simple-jekyll-search)\n\n[](https://travis-ci.org/christian-fei/Simple-Jekyll-Search)\n[](https://david-dm.org/christian-fei/Simple-Jekyll-Search)\n[](https://david-dm.org/christian-fei/Simple-Jekyll-Search?type=dev)\n\nA JavaScript library to add search functionality to any Jekyll blog.\n\n## Use case\n\nYou have a blog, built with Jekyll, and want a **lightweight search functionality** on your blog, purely client-side?\n\n*No server configurations or databases to maintain*.\n\nJust **5 minutes** to have a **fully working searchable blog**.\n\n---\n\n## Installation\n\n### npm\n\n```sh\nnpm install simple-jekyll-search\n```\n\n## Getting started\n\n### Create `search.json`\n\nPlace the following code in a file called `search.json` in the **root** of your Jekyll blog. (You can also get a copy [from here](/example/search.json))\n\nThis file will be used as a small data source to perform the searches on the client side:\n\n```yaml\n---\nlayout: none\n---\n[\n {% for post in site.posts %}\n {\n \"title\" : \"{{ post.title | escape }}\",\n \"category\" : \"{{ post.category }}\",\n \"tags\" : \"{{ post.tags | join: ', ' }}\",\n \"url\" : \"{{ site.baseurl }}{{ post.url }}\",\n \"date\" : \"{{ post.date }}\"\n } {% unless forloop.last %},{% endunless %}\n {% endfor %}\n]\n```\n\n\n## Preparing the plugin\n\n### Add DOM elements\n\nSimpleJekyllSearch needs two `DOM` elements to work:\n\n- a search input field\n- a result container to display the results\n\n#### Give me the code\n\nHere is the code you can use with the default configuration:\n\nYou need to place the following code within the layout where you want the search to appear. (See the configuration section below to customize it)\n\nFor example in **_layouts/default.html**:\n\n```html\n<!-- HTML elements for search -->\n<input type=\"text\" id=\"search-input\" placeholder=\"Search blog posts..\">\n<ul id=\"results-container\"></ul>\n\n<!-- or without installing anything -->\n<script src=\"https://unpkg.com/simple-jekyll-search@latest/dest/simple-jekyll-search.min.js\"></script>\n```\n\n\n## Usage\n\nCustomize SimpleJekyllSearch by passing in your configuration options:\n\n```js\nvar sjs = SimpleJekyllSearch({\n searchInput: document.getElementById('search-input'),\n resultsContainer: document.getElementById('results-container'),\n json: '/search.json'\n})\n```\n\n### returns { search }\n\nA new instance of SimpleJekyllSearch returns an object, with the only property `search`.\n\n`search` is a function used to simulate a user input and display the matching results. \n\nE.g.:\n\n```js\nvar sjs = SimpleJekyllSearch({ ...options })\nsjs.search('Hello')\n```\n\n💡 it can be used to filter posts by tags or categories!\n\n## Options\n\nHere is a list of the available options, usage questions, troubleshooting & guides.\n\n### searchInput (Element) [required]\n\nThe input element on which the plugin should listen for keyboard event and trigger the searching and rendering for articles.\n\n\n### resultsContainer (Element) [required]\n\nThe container element in which the search results should be rendered in. Typically a `<ul>`.\n\n\n### json (String|JSON) [required]\n\nYou can either pass in an URL to the `search.json` file, or the results in form of JSON directly, to save one round trip to get the data.\n\n\n### searchResultTemplate (String) [optional]\n\nThe template of a single rendered search result.\n\nThe templating syntax is very simple: You just enclose the properties you want to replace with curly braces.\n\nE.g.\n\nThe template\n\n```js\nvar sjs = SimpleJekyllSearch({\n searchInput: document.getElementById('search-input'),\n resultsContainer: document.getElementById('results-container'),\n json: '/search.json',\n searchResultTemplate: '<li><a href=\"{{ site.url }}{url}\">{title}</a></li>'\n})\n```\n\nwill render to the following\n\n```html\n<li><a href=\"/jekyll/update/2014/11/01/welcome-to-jekyll.html\">Welcome to Jekyll!</a></li>\n```\n\nIf the `search.json` contains this data\n\n```json\n[\n {\n \"title\" : \"Welcome to Jekyll!\",\n \"category\" : \"\",\n \"tags\" : \"\",\n \"url\" : \"/jekyll/update/2014/11/01/welcome-to-jekyll.html\",\n \"date\" : \"2014-11-01 21:07:22 +0100\"\n }\n]\n```\n\n\n### templateMiddleware (Function) [optional]\n\nA function that will be called whenever a match in the template is found.\n\nIt gets passed the current property name, property value, and the template.\n\nIf the function returns a non-undefined value, it gets replaced in the template.\n\nThis can be potentially useful for manipulating URLs etc.\n\nExample:\n\n```js\nSimpleJekyllSearch({\n ...\n templateMiddleware: function(prop, value, template) {\n if (prop === 'bar') {\n return value.replace(/^\\//, '')\n }\n }\n ...\n})\n```\n\nSee the [tests](https://github.com/christian-fei/Simple-Jekyll-Search/blob/master/tests/Templater.test.js) for an in-depth code example\n\n### sortMiddleware (Function) [optional]\n\nA function that will be used to sort the filtered results.\n\nIt can be used for example to group the sections together.\n\nExample:\n\n```js\nSimpleJekyllSearch({\n ...\n sortMiddleware: function(a, b) {\n var astr = String(a.section) + \"-\" + String(a.caption);\n var bstr = String(b.section) + \"-\" + String(b.caption);\n return astr.localeCompare(bstr)\n }\n ...\n})\n```\n\n### noResultsText (String) [optional]\n\nThe HTML that will be shown if the query didn't match anything.\n\n\n### limit (Number) [optional]\n\nYou can limit the number of posts rendered on the page.\n\n\n### fuzzy (Boolean) [optional]\n\nEnable fuzzy search to allow less restrictive matching.\n\n### exclude (Array) [optional]\n\nPass in a list of terms you want to exclude (terms will be matched against a regex, so URLs, words are allowed).\n\n### success (Function) [optional]\n\nA function called once the data has been loaded.\n\n### debounceTime (Number) [optional]\n\nLimit how many times the search function can be executed over the given time window. This is especially useful to improve the user experience when searching over a large dataset (either with rare terms or because the number of posts to display is large). If no `debounceTime` (milliseconds) is provided a search will be triggered on each keystroke.\n\n---\n\n## If search isn't working due to invalid JSON\n\n- There is a filter plugin in the _plugins folder which should remove most characters that cause invalid JSON. To use it, add the simple_search_filter.rb file to your _plugins folder, and use `remove_chars` as a filter.\n\nFor example: in search.json, replace\n\n```json\n\"content\": \"{{ page.content | strip_html | strip_newlines }}\"\n```\n\nwith\n\n```json\n\"content\": \"{{ page.content | strip_html | strip_newlines | remove_chars | escape }}\"\n```\n\nIf this doesn't work when using Github pages you can try `jsonify` to make sure the content is json compatible:\n\n```js\n\"content\": {{ page.content | jsonify }}\n```\n\n**Note: you don't need to use quotes `\"` in this since `jsonify` automatically inserts them.**\n\n\n## Enabling full-text search\n\nReplace `search.json` with the following code:\n\n```yaml\n---\nlayout: none\n---\n[\n {% for post in site.posts %}\n {\n \"title\" : \"{{ post.title | escape }}\",\n \"category\" : \"{{ post.category }}\",\n \"tags\" : \"{{ post.tags | join: ', ' }}\",\n \"url\" : \"{{ site.baseurl }}{{ post.url }}\",\n \"date\" : \"{{ post.date }}\",\n \"content\" : \"{{ post.content | strip_html | strip_newlines }}\"\n } {% unless forloop.last %},{% endunless %}\n {% endfor %}\n ,\n {% for page in site.pages %}\n {\n {% if page.title != nil %}\n \"title\" : \"{{ page.title | escape }}\",\n \"category\" : \"{{ page.category }}\",\n \"tags\" : \"{{ page.tags | join: ', ' }}\",\n \"url\" : \"{{ site.baseurl }}{{ page.url }}\",\n \"date\" : \"{{ page.date }}\",\n \"content\" : \"{{ page.content | strip_html | strip_newlines }}\"\n {% endif %}\n } {% unless forloop.last %},{% endunless %}\n {% endfor %}\n]\n```\n\n\n\n## Development\n\n- `npm install`\n- `npm test`\n\n#### Acceptance tests\n\n```bash\ncd example; jekyll serve\n\n# in another tab\n\nnpm run cypress -- run\n```\n\n## Contributors\n\nThanks to all [contributors](https://github.com/christian-fei/Simple-Jekyll-Search/graphs/contributors) over the years! You are the best :)\n\n> [@daviddarnes](https://github.com/daviddarnes)\n[@XhmikosR](https://github.com/XhmikosR)\n[@PeterDaveHello](https://github.com/PeterDaveHello)\n[@mikeybeck](https://github.com/mikeybeck)\n[@egladman](https://github.com/egladman)\n[@midzer](https://github.com/midzer)\n[@eduardoboucas](https://github.com/eduardoboucas)\n[@kremalicious](https://github.com/kremalicious)\n[@tibotiber](https://github.com/tibotiber)\nand many others!\n\n## Stargazers over time\n\n[](https://starchart.cc/christian-fei/Simple-Jekyll-Search)\n"
Note: you don’t need to use quotes "
in this since jsonify
automatically inserts them.
Replace search.json
with the following code:
---
layout: none
---
[
,
{
"title" : "Page Not Found",
"category" : "",
"tags" : "",
"url" : "/404.html",
"date" : "",
"content" : "Sorry, but the page you were trying to view does not exist — please return to our course homepage."
} ,
{
"title" : "VMD Tutorial",
"category" : "",
"tags" : "",
"url" : "/coronavirus/VMDTutorial",
"date" : "",
"content" : "This is a short tutorial on how to use VMD to visualize molecules and perform some basic analysis. Before you start, make sure to have downloaded and installed VMD.Loading MoleculesThese steps will be on how to load molecules into VMD. We will use the example of 6vw1.Download the protein structure of 6vw1 from the protein data bank.Next we can launch VMD and load the molecule into the program. In VMD Main, navigate to File > New Molecule. Click Browse, select the molecule (6vw1.pdb) and click Load.The molecule should now be listed in VMD Main as well as the visualization in the OpenGL Display.Section to be movedGlycansFor VMD, there is no specific keyword to select glycans. A workaround is to use the keywords: “not protein and not water”. To recreate the basic VMD visualizations from the module of the open-state (6vyb) of SARS-CoV-2 Spike, use the following representations. (For the protein chains, use Glass3 for Material).The end result should look like this: Visualization Exercise Try to recreate the visualization of Hotspot31 for SARS-CoV-2 (same molecule as the tutorial). The important residues and their corresponding colors are listed on the left. "
} ,
{
"title" : "Ab initio Protein Structure Prediction",
"category" : "",
"tags" : "",
"url" : "/coronavirus/ab_initio",
"date" : "",
"content" : "Modeling ab initio structure prediction as an exploration problemPredicting a protein’s structure using only its amino acid sequence is called ab initio structure prediction (ab initio means “from the beginning” in Latin). Although many different algorithms have been developed for ab initio protein structure through the years, these algorithms all find themselves solving a similar problem.Biochemical research has contributed to the development of scoring functions called force fields that use the physicochemical properties of amino acids introduced in the previous lesson to compute the potential energy of a candidate protein shape. For a given choice of force field, we can think of ab initio structure prediction as solving the following problem: given a primary structure of a polypeptide, find its tertiary structure having minimum energy. This problem exemplifies an optimization problem, in which we are seeking an object maximizing or minimizing some function subject to constraints.The formulation of protein structure prediction as an optimization problem may not strike you as similar to anything that we have done before in this course. However, consider once more a bacterium exploring an environment for food. Every point in the bacterium’s “search space” is characterized by a concentration of attractant, and the bacterium’s goal is to reach the point of maximum attractant concentration.In the case of structure prediction, our search space is the collection of all possible conformations of a given protein, and each point in this search space represents a single conformation with an associated potential energy. Just as we imagined a ball rolling down a hill to find lower energy, we can now imagine exploring the search space of all conformations of a polypeptide to find the conformation having lowest energy. The general problem of exploring a search space to find a point minimizing some function is illustrated in the figure below, in which the height of each point represents the value of the function at that point, and our goal is to find the lowest point in the space.Optimization problems can be thought of as exploring a landscape, in which the height of a point is the value of the function that we wish to optimize. Finding the highest or lowest point in this landscape corresponds to maximizing or minimizing the function over the search space. Image courtesy: David Beamish.A local search algorithm for ab initio structure predictionNow that we have conceptualized the protein structure prediction problem as exploring a search space, we will develop an algorithm to explore this space. Our idea is to use an approach similar to E. coli’s clever exploration algorithm from a previous module: over a sequence of steps, we will consult a collection of nearby points in the space, and then move in the “direction” in which the energy function decreases the most. This approach belongs to a broad category of optimization algorithms called local search algorithms.Adapting a local search algorithm to protein structure prediction requires us to develop a notion of what it means to consider the points “nearby” a given conformation in a protein search space. Many ab initio algorithms start at an arbitrary initial conformation and then make a variety of minor modifications to that structure (i.e., nearby points in the space), updating the current conformation to the modification that produces the greatest decrease in free energy. These algorithms then iterate the process of progressively altering the protein structure to have the greatest decrease in potential energy. They terminate the search after reaching a structure for which no changes to the structure further reduce this energy.Yet returning to the chemotaxis analogy, imagine what happens if we were to place many small sugar cubes and one large sugar cube into the bacterium’s environment. The bacterium will sense the gradient not of the large sugar cube but of its nearest attractant. Because the smaller food sources outnumber the larger food source, the bacterium will likely not reach the point of greatest attractant concentration. In bacterial exploration, this is a feature, not a bug; if the bacterium exhausts one food source, then it will just move to another. But in protein structure prediction, we should be wary of a local search algorithm returning a protein structure that does not have minimum free energy but that does have the property that no “nearby” structures have lower energy.In general, an object in a search space that has a smaller value of the optimization function than neighboring points is called a local minimum. Returning to our landscape analogy, our search space may have many valleys, but in an optimization problem, we are seeking the lowest valley over the entire landscape, called a global minimum.STOP: How could we improve our local search algorithm for structure prediction to avoid winding up in a local minimum?Researchers applying local search algorithms have devised a number of ways to avoid local minima, two of which are so fundamental that we should mention them. First, because the algorithm’s choice of initial conformation has a huge influence on the final conformation, we should run the algorithm multiple times with different starting conformations. This is analogous to allowing multiple bacteria to explore their environment at different starting points. Second, every time we reach a local minimum, we could allow ourselves to change the structure with some probability, thus giving our local search algorithm the chance to “bounce” out of a local minimum. Once again, randomized algorithms help us solve problems!Applying an ab initio algorithm to a protein sequenceTo run an ab initio structure prediction algorithm on a real protein, we will use a software resource called QUARK, which is built upon the ideas discussed in the previous section, with some added features. For example, QUARK’s algorithm applies a combination of multiple scoring functions to look for the lowest energy conformation across all of these functions.Levinthal’s paradox means that the search space of all possible structures for a protein is so large that accurately predicting large protein structures with ab initio modeling remains very difficult. As such, QUARK limits us to proteins with at most 200 amino acids, and so we will run it only on human hemoglobin subunit alpha.Visit tutorialToward a faster approach for protein structure predictionThe figure below shows the top five predicted human hemoglobin subunit alpha structures returned by QUARK as well as the protein’s experimentally verified structure, and an average of these six structures. It takes a keen eye to see any differences between these structures. We conclude that ab initio prediction can be accurate.The experimentally verified protein structure of human hemoglobin subunit alpha (top left) along with five models of this protein produced by QUARK from the protein’s primary sequence, all of which are nearly indistinguishable from the verified structure with the naked eye.Yet we also wonder if we can speed up our structure prediction algorithms so that they will scale to a larger protein like the SARS-CoV-2 spike protein. In the next lesson, we will learn about another type of protein structure prediction that uses a database of known structures.Next lesson"
} ,
{
"title" : "Protein Structure Comparison",
"category" : "",
"tags" : "",
"url" : "/coronavirus/accuracy",
"date" : "",
"content" : "In this lesson, we will compare the results of the SARS-CoV-2 spike protein prediction from the previous lesson against each other and against the protein’s empirically validated structure. To do so, we need a method of comparing two structures.Comparing two shapes with the Kabsch algorithmComparing two protein structures is intrinsically similar to comparing two shapes, such as those shown in the figure below.STOP: Consider the two shapes in the figure below. How similar are they?If you think you have a good handle on comparing the above two shapes, then it is because humans have very highly evolved eyes and brains. As we will see in the next module, training a computer to detect and differentiate objects is more difficult than you think!We would like to develop a distance function d(S, T) quantifying how different two shapes S and T are. If S and T are the same, then d(S, T) should be equal to zero; the more different S and T become, the larger d should become.You may have noticed that the two shapes in the preceding figure are, in fact, identical. To demonstrate that this is true, we can first move the red shape to superimpose it over the blue shape, then flip the red shape, and finally rotate it so that its boundary coincides with the blue shape, as shown in the animation below. In general, if a shape S can be translated, flipped, and/or rotated to produce shape T, then S and T are the same shape, and so d(S, T) should be equal to zero. The question is what d(S, T) should be if S and T are not the same shape.We can transform the red shape into the blue shape by translating it, flipping it, and then rotating it.Our idea for defining d(S, T), then, is first to translate, flip, and rotate S so that it resembles T “as much as possible” to give us a fair comparison. Once we have done so, we should devise a metric to quantify the difference between the two shapes that will represent d(S, T).We first translate S to have the same center of mass (or center of mass) as T. The center of mass of S is found at the point (xS, yS) such that xS and yS are the respective averages of the x-coordinates and y-coordinates on the boundary of S. The center of mass of some shapes can be determined mathematically. But for irregular shapes, we will first sample n points from the boundary of S and then estimate xS and yS as the average of all the respective x- and y-coordinates from the sampled points.Next, imagine that we have found the desired rotation and flip of S that makes it resemble T as much as possible; we are now ready to define d(S, T) in the following way. We sample n points along the boundary of each shape, converting S and T into vectors s = (s1, …, sn) and t = (t1, …, tn), where si is the i-th point on the boundary of S. The root mean square deviation (RMSD) between the two vectors is the square root of the average squared distance between corresponding points,\[\text{RMSD}(\mathbf{s}, \mathbf{t}) = \sqrt{\dfrac{1}{n} \cdot [d(s_1, t_1)^2 + d(s_2, t_2)^2 + \cdots + d(s_n, t_n)^2]} \,.\]In this formula, d(si, ti) is the distance between the points si and ti.Note: RMSD is a common approach across data science when measuring the differences between two vectors.For an example two-dimensional RMSD calculation, consider the figure below, which shows two shapes with four points sampled from each. (For simplicity, the shapes do not have the same center of mass.)Two shapes with four points sampled from each.The distances between corresponding points in this figure are equal to \(\sqrt{2}\), 1, 4, and \(\sqrt{2}\). As a result, we compute the RMSD as\[\begin{align*}\text{RMSD}(\mathbf{s}, \mathbf{t}) & = \sqrt{\dfrac{1}{4} \cdot (\sqrt{2}^2 + 1^2 + 2^2 + \sqrt{2}^2)} \\& = \sqrt{\dfrac{1}{4} \cdot 9}\\& = \sqrt{\dfrac{9}{4}}\\& = \dfrac{3}{2}\end{align*}\]STOP: Do you see any issues with using RMSD to compare two shapes?Even if we assume that two shapes have already been overlapped and rotated appropriately, we still should ensure that we sample enough points to give a good approximation of how different the shapes are. Consider a circle inscribed within a square, as shown in the figure below. If we happened to sample only the four points indicated in this figure, then we would sample the same points for each shape and conclude that the RMSD between these two shapes is equal to zero. Fortunately, this issue is easily resolved by making sure to sample enough points from the shape boundaries.A circle inscribed within a square. Sampling of only the four points where the shapes intersect will give an RMSD of zero, a flawed estimate for the distance between the two shapes.However, we have still assumed that we already rotated (and possibly flipped) S to be as “similar” to T as possible. In practice, after superimposing S and T to have the same center of mass, we would like to find the flip and/or rotation of S that minimizes the RMSD between our vectorizations of S and T over all possible ways of flipping and rotating S. It is this minimum RMSD that we define as d(S, T).The best way of rotating and flipping S so as to minimize the RMSD between the resulting vectors s and t can be found with a method called the Kabsch algorithm. Explaining this algorithm requires some advanced linear algebra and is beyond the scope of our work but is described here.PDB format represents a protein’s structureThe Kabsch algorithm offers a compelling way to determine the similarity of two protein structures. We can convert a protein containing n amino acids into a vector of length n by selecting a single representative point from each amino acid. For this representative point, we typically use the alpha carbon, the amino acid’s centrally located carbon atom.Whether a protein structure is experimentally validated or predicted by an algorithm, the structure is often represented in a unified file format used by the PDB called .pdb format. In this format (see the figure below), each atom in the protein is labeled according to several different characteristics, including: the element from which the atom derives; the amino acid in which the atom is contained; the chain on which this amino acid is found; the position of the amino acid within this chain; and the 3D coordinates (x, y, z) of the atom in angstroms (10-10 meters).Lines 2,159 to 2,175 of the .pdb file for the experimentally verified SARS-CoV-2 spike protein structure, PDB entry 6vxx. These 17 lines contain information on the atoms taken from two amino acids, alanine and tyrosine. The rows corresponding to these amino acids’ alpha carbons are shown in green and appear as “CA” under the “Element” column. Column labels are as follows: “Index” refers to the number of the amino acid; “Element” identifies the chemical element to which this atom corresponds; “Chain” indicates which chain on which the atom is found; “Position” identifies the position in the protein of the amino acid from which the atom is taken; “Coordinates” indicates the x, y, and z coordinates of the atom’s position (in angstroms).Note: The above figure shows just part of the information needed to fully represent a protein structure. For example, a .pdb file will also contain information about the disulfide bonds between amino acids. For more information, consult the official PDB documentation).The Kabsch algorithm can be fooledAlthough the Kabsch algorithm is powerful, we should be careful when applying it. Consider the figure below, which shows two toy protein structures; the orange structure (S) is identical to the blue structure (T) except for a change in a single bond angle between the third and fourth amino acids. And yet this tiny change in the protein’s structure causes a significant increase in d(si, ti) for every i greater than 3, which inflates the RMSD.(Top) Two hypothetical protein structures that differ in only a single bond angle between the third and fourth amino acids, shown in red. Each circle represents an alpha carbon. (Bottom left) Superimposing the first three amino acids shows how much the change in the bond angle throws off the computation of RMSD by increasing the distances between corresponding alpha carbons. (Bottom right) The Kabsch algorithm would align the centers of gravity of the two structures in order to minimize RMSD between corresponding alpha carbons. This alignment belies the similarity in the structures and makes it difficult for the untrained observer to notice the proteins’ similarity.Another way in which the Kabsch algorithm can be tricked is in the case of an appended substructure that throws off the ordering of the amino acids. The following figure shows a toy example of a structure into which we incorporate a loop, thus throwing off the natural order of comparing amino acids. (The same effect is caused if one or more amino acids are deleted from one of the two proteins.)Two toy two protein structures, one of which includes a loop of three amino acids. After the loop, each amino acid in the orange structure will be compared against an amino acid that occurs farther long in the blue structure, thus increasing d(si, ti)2 for each such amino acid.To address this second issue, biologists often first align the sequences of two proteins, discarding any amino acids that do not align before performing a vectorization of structures for the RMSD calculation. We will soon see an example of a protein sequence alignment when comparing the coronavirus spike proteins.In short, if the RMSD of two proteins is large, then we should be wary of concluding that the proteins are very different, and we may need to combine RMSD with other methods of structure comparison. But if the RMSD is small (e.g., just a few angstroms), then we can have confidence that the proteins are indeed similar.We are now ready to apply the Kabsch algorithm to compare the structures that we predicted for human hemoglobin subunit alpha and the SARS-CoV-2 spike protein against their respective experimentally validated structures.Visit tutorialAssessing the accuracy of our structure prediction modelsIn the tutorials occurring earlier in this module, we used protein structure prediction software to predict the structure of human hemoglobin subunit alpha (using ab initio modeling) and the SARS-CoV-2 spike protein (using homology modeling). We will now see how well our models performed by showing the values of RMSD produced by the Kabsch algorithm when comparing each of these models against the validated structures.Ab initio (QUARK) models of Human Hemoglobin Subunit AlphaThe table below shows the RMSD between each of the five predicted structures returned by QUARK and the validated structure of human hemoglobin subunit alpha (PDB entry: 1si4). We are tempted to conclude that our ab initio prediction was a success. However, because human hemoglobin subunit alpha is so short (141 amino acids), researchers would consider this RMSD score to be high. Quark Model RMSD QUARK1 1.58 QUARK2 2.0988 QUARK3 3.11 QUARK4 1.9343 QUARK5 2.6495 It is tempting to conclude that our ab initio prediction was a success. However, because human hemoglobin subunit alpha is such a short protein (141 amino acids), researchers would consider this RMSD score high.Homology models of the SARS-CoV-2 spike proteinIn the homology tutorial, we used SWISS-MODEL and Robetta to predict the structure of the SARS-CoV-2 spike protein, and we used GalaxyWeb to predict the structure of this protein’s receptor binding domain (RBD).GalaxyWEBFirst, we consider the five GalaxyWEB models produced for the spike protein RBD. The following table shows the RMSD between each of these models and the validated SARS-CoV-2 RBD (PDB entry: 6lzg). GalaxyWEB RMSD Galaxy1 0.1775 Galaxy2 0.1459 Galaxy3 0.1526 Galaxy4 0.1434 Galaxy5 0.1202 All of these models have an excellent RMSD score and can be considered very accurate. Note that their RMSD is more than an order of magnitude lower than the RMSD computed for our ab initio model of hemoglobin subunit alpha, despite the fact that the RBD is longer (229 amino acids).SWISS-MODELWe now shift to homology models of the entire spike protein and start with SWISS-MODEL. The following table shows the RMSD between each of three structures produced by SWISS-MODEL and the validated structure of the SARS-CoV-2 spike protein (PDB entry: 6vxx). SWISS MODEL RMSD SWISS1 5.8518 SWISS2 11.3432 SWISS3 11.3432 The first structure has a lowest RMSD over the three models, and even though its RMSD (5.818) is significantly higher than what we saw for the GalaxyWEB prediction for the RBD, keep in mind that the spike protein is 1281 amino acids long, and so the sensitivity of RMSD to slight changes should give us confidence that our models are on the right track.RobettaFinally, we produced five predicted structures of a single chain of the SARS-CoV-2 spike protein using Robetta. The following table compares each of them against the validated structure of the SARS-CoV-2 spike protein (PDB: 6vxx). Robetta RMSD Robetta1 3.1189 Robetta2 3.7568 Robetta3 2.9972 Robetta4 2.5852 Robetta5 12.0975 STOP: Which do you think performed more accurately on our predictions: SWISS-MODEL or Robetta?Distributing protein structure prediction around the worldWhile some researchers were working to elucidate the structure of the SARS-CoV-2 spike protein experimentally, thousands of users were devoting their computers to the cause of predicting the protein’s structure computationally. Two leading structure prediction projects, Rosetta@home and Folding@home, encourage volunteers to download their software and contribute to a gigantic distributed effort to predict protein shape. Even with a modest laptop, a user can donate some of their computer’s idle resources to help predict protein structure.The results of the SSGCID models of the S protein released by Rosetta@Home are shown below. SSGCID RMSD (Full Protein) RMSD (Single Chain) SSGCID1 3.505 2.7843 SSGCID2 2.3274 2.107 SSGCID3 2.12 1.866 SSGCID4 2.0854 2.047 SSGCID5 4.9636 4.6443 As we might expect due to having access to thousands of users’ computers, the SSGCID models outperform our SWISS-MODEL models. Yet a typical threshold for whether a predicted structure is accurate is if, when compared to a validated structure, its RMSD is smaller than 2.0 angstroms, a test that the models in the above table do not pass.The inability of even powerful models to obtain an accurate predicted structure for the SARS-CoV-2 spike protein may make it seem that protein structure prediction is a lost cause. Perhaps biochemists should head back to their expensive experimental validations and ignore the musings of computational scientists. In the conclusion to part 1, we will find hope.Next lesson"
} ,
{
"title" : "Methylation Helps a Bacterium Adapt to Differing Concentrations",
"category" : "",
"tags" : "",
"url" : "/chemotaxis/adaptation",
"date" : "",
"content" : "Bacterial tumbling frequencies remain constant for different attractant concentrationsIn the previous lesson, we explored the signal transduction pathway by which E. coli can change its tumbling frequency in response to a change in the concentration of an attractant. But the reality of cellular environments is that the concentration of an attractant can vary across several orders of magnitude. The cell therefore needs to detect not absolute concentrations of an attractant but rather relative changes.E. coli detects relative changes in its concentration via adaptation to these changes. If the concentration of attractant remains constant for a period of time, then regardless of the absolute value of the concentration, the cell returns to the same background tumbling frequency. That is, E. coli demonstrates robustness to the attractant concentration in maintaining its default tumbling behavior.However, our current model is not able to address this adaptation. If the ligand concentration increases in the model, then phosphorylated CheY will plummet and remain at a low steady state.In this lesson, we will investigate the biochemical mechanism that E. coli uses to achieve such a robust response to environments with different background concentrations. We will then expand the model we built in the previous lesson to replicate the bacterium’s adaptive response.Bacteria remember past concentrations using methylationRecall that in the absence of an attractant, CheW and CheA readily bind to an MCP, leading to greater autophosphorylation of CheA, which in turn phosphorylates CheY. The greater the concentration of phosphorylated CheY, the more frequently the bacterium tumbles.Signal transduction is achieved through phosphorylation, but E. coli maintains a “memory” of past environmental concentrations through a chemical process called methylation. In this reaction, a methyl group (-CH3) is added to an organic molecule; the removal of a methyl group is called demethylation.Every MCP receptor contains four methylation sites, meaning that between zero and four methyl groups can be added to the receptor. On the plasma membrane, many MCPs, CheW, and CheA molecules form an array structure. Methylation reduces the negative charge on the receptors, stabilizing the array and facilitating CheA autophosphorylation. The more sites that are methylated, the higher the autophosphorylation rate of CheA, which means that CheY has a higher phosphorylation rate, and tumbling frequency increases.We now have two different ways that tumbling frequency can be elevated. First, if the concentration of an attractant is low, then CheW and CheA freely form a complex with the MCP, and the phosphorylation cascade passes phosphoryl groups to CheY, which interacts with the flagella and keeps tumbling frequency high. Second, an increase in MCP methylation can also boost CheA autophosphorylation and lead to increased tumbling frequency.Methylation of MCPs is achieved by an additional protein called CheR. When bound to MCPs, CheR methylates ligand-bound MCPs faster12, and so the rate of MCP methylation by CheR is higher if the MCP is bound to a ligand.3. Let’s consider how this fact affects a bacterium’s behavior.Say that E. coli encounters an increase in attractant concentration. Then the lack of a phosphorylation cascade will mean less phosphorylated CheY, and so the tumbling frequency will decrease. However, if the attractant concentration levels off, then the tumbling frequency will flatten, while CheR starts methylating the MCP. Over time, the rising methylation will increase CheA autophosphorylation, bringing back the phosphorylation cascade and raising tumbling frequency back to default levels.Just as the phosphorylation of CheY can be reversed, MCP methylation can be undone to prevent methylation from being permanent. In particular, an enzyme called CheB, which like CheY is phosphorylated by CheA, demethylates MCPs (as well as autodephosphorylates). The rate of an MCP’s demethylation is dependent on the extent to which the MCP is methylated. In other words, the rate of MCP methylation is higher when the MCP is in a low methylation state, and the rate of demethylation is faster when the MCP is in a high methylation state.3The figure below adds CheR and CheB to provide a complete picture of the core pathways influencing chemotaxis. To model these pathways and see how our simulated bacterial system responds to different relative attractant concentrations, we will need to add quite a few molecules and reactions to our current model.The chemotaxis signal-transduction pathway with methylation included. CheA phosphorylates CheB, which methylates MCPs, while CheR demethylates MCPs. Blue lines denote phosphorylation, grey lines denote dephosphorylation, green arrows denote methylation, and red arrows denote demethlyation. Image modified from Parkinson Lab’s illustrations.Combinatorial explosion and the need for rule-based modelingTo expand our model, we will need to include methylation of the MCP by CheR and demethylation of the MCP by CheB. For simplicity, we will use three methylation levels (low, medium, and high) rather than five.Imagine that we were attempting to specify every reaction that could take place in our model. To specify an MCP, we would need to establsh whether it is bound to a ligand (two possible states), whether it is bound to CheR (two possible states), whether it is phosphorylated (two possible states), and which methylation state it is in (three possible states). Therefore, a given MCP would need 2 · 2 · 2 · 3 = 24 total states.Consider the simple reaction of a ligand binding to an MCP, which we originaly wrote as T + L → TL. We now need this reaction to include 12 of the 24 states, the ones corresponding to the MCP being unbound to the ligand. Our previously simply reaction would become, 12 different reactions, one for each possible unbound state of the complex molecule T. And if the situation were just a little more complex, with the ligand molecule L having n possible states, then we would have 12n reactions. Image trying to debug a model in which we had accidentally incorporated a type when transcribing just one of these reaction!In other words, as our model grows, with multiple different states for each molecule involved in each reaction, the number of reactions that we need to represent the system grows rapidly; this phenomenon is called combinatorial explosion and means that building realistic models of biochemical systems at scale can be daunting.Yet all these 12 reactions can be summarized with a single rule: a ligand and a receptor can bind into a complex if the receptor is unbound. Moreover, all 12 reactions implied by the rule are easily inferable from it. This example illustrates rule-based modeling, a paradigm applied by BioNetGen in which a potentially enormous number of reactions are specified by a much smaller collection of “rules” from which all reactions can be inferred.We will not bog down the main text with a full specification of all the rules needed to add methylation to our model while avoiding combinatorial explosion. If you are interested in the details, please follow our tutorial.Visit tutorialBacterial tumbling is robust to large sudden changes in attractant concentrationIn the figures that follow, we plot the concentration over time of each molecule for different values of l0, the initial concentration of ligand. From what we have learned about E. coli, we should see the concentration of phosphorylated CheY (and therefore the bacterium’s tumbling frequency) drop before returning to its original equilibrium. But will our simulation capture this behavior?First, we add a relatively small amount of attractant, setting l0 equal to 10,000. The system returns so quickly to an equilibrium in phosphorylated CheY that it is difficult to imagine that the attractant has had any effect on tumbling frequency.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 10,000 initial attractant ligand particles.If instead l0 is equal to 100,000, then we obtain the figure below. After an initial drop in the concentration of phosphorylated CheY, it returns to equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100,000 initial attractant ligand particles.When we increase l0 by another factor of ten to 1 million, the initial drop is more pronounced, but the system returns just as quickly to equilibrium. Note how much higher the concentration of methylated receptors are in this figure compared to the previous figure; however, there are still a significant concentration of receptors with low methylation, indicating that the system may be able to handle an even larger jolt of attractant.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with one million initial attractant ligand particles.When we set l0 equal to 10 million, we give the system this bigger jolt. Once again, the model returns to its previous CheY equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with ten million initial attractant ligand particles.Finally, with l0 equal to 100 million, we see what we might expect: the steepest drop in phosphorylated CheY yet, but a system that is able to return to equilibrium after a few minutes.Molecular concentrations (in number of molecules in the cell) over time (in seconds) in a BioNetGen chemotaxis simulation with 100 million initial attractant ligand particles.Our model, which is built on real reaction rate parameters, provides compelling evidence that the E. coli chemotaxis system is robust to changes in its environment across several orders of magnitude of attractant concentration. This robustness has been observed in real bacteria45, as well as replicated by other computational simulations6.Aren’t bacteria magnificent?Traveling up an attractant gradientWe have simulated E. coli adapting to a single sudden change in its environment, but life often depends on responding to continual change. Imagine a glucose cube in an aqueous solution. As the cube dissolves, a gradient will form, with a glucose concentration that decreases outward from the cube. How will the tumbling frequency of E. coli change if the bacterium finds itself in an environment of an attractant gradient? Will the tumbling frequency decrease continuously as well, or will we see more complicated behavior? And once the cell reaches a region of high attractant concentration, will its default tumbling frequency stabilize to the same steady state?We will modify our model by increasing the concentration of the attractant ligand exponentially and seeing how the concentration of phosphorylated CheY changes. This model will simulate a bacterium traveling up an attractant gradient toward an attractant. Moreover, we will examine how the concentration of phosphorylated CheY changes as we change the gradient’s “steepness”, or the rate at which attractant ligand is increasing. Visit the following tutorial if you’re interested in following our adjustments for yourself.Visit tutorialSteady state tumbling frequency is robustTo model a ligand concentration [L] that is increasing exponentially, we will use the function [L] = l0 · ek · t, where e is Euler’s number (e = 2.71828…) l0 is the initial ligand concentration, k is a constant dictating the rate of exponential growth, and t is time. The parameter k represents the steepness of the gradient, since the higher the value of k, the faster the growth in the ligand concentration [L].For example, the following figure shows the concentration over time of phosphorylated CheY (orange) when l0 = 1000 and k = 0.1. The concentration of phosphorylated CheY, and therefore the tumbling frequency, still decreases sharply as the ligand concentration increases, but after all ligands become bound to receptors (the plateau in the blue curve), receptor methylation causes the concentration of phosphorylated CheY to return to its equilibrium. For these values of l0 and k, the outcome is similar to when we provided an instantaneous increase in ligand, although the cell takes longer to reach its minimum concentration of phosphorylated CheY because the attractant concentration is increasing gradually.Plots of relevant molecule concentrations in our model (in number of molecules in the cell) over time (in seconds) when the concentration of ligand grows exponentially with l0 = 1000 and k = 0.1. The concentration of bound ligand (shown in red) quickly hits saturation, which causes a minimum in phosphorylated CheY (orange), and therefore a low tumbling frequency. To respond, the cell increases the methylation of receptors, which boosts the concentration of phosphorylated CheY back to equilibrium.The following figure shows the results of multiple simulations in which we vary the growth parameter k and plot only the concentration of phosphorylated CheY over time. The larger the value of k, the faster the increase in receptor binding, and the steeper the drop in the concentration of phosphorylated CheY.Plots of the concentration of phosphorylated CheY over time (in seconds) for different growth rates k of ligand concentration. The larger the value of k, the steeper the initial drop in the concentration of phosphorylated CheY, and the faster that methylation returns the concentration of phosphorylated CheY to equilibrium. The same equilibrium is obtained regardless of the value of k.The above figure further illustrates the robustness of bacterial chemotaxis to the rate of growth in ligand concentration. Whether the growth of the attractant is slow or fast, methylation will always bring the cell back to the same equilibrium concentration of phosphorylated CheY and therefore the same background tumbling frequency.From changing tumbling frequencies to an exploration algorithmWe hope that our work here has conveyed the elegance of bacterial chemotaxis, as well as the power of rule-based modeling and the Gillespie algorithm for simulating a complex biochemical system that may include a huge number of reactions.And yet we are missing an important part of the story. E. coli has evolved to ensure that if it detects a relative increase in concentration (i.e., an attractant gradient), then it can reduce its tumbling frequency in response. But we have not explored why changing its tumbling frequency would help a bacterium find food in the first place. After all, according to the run and tumble model, the direction that a bacterium is moving at any point in time is random!This quandary does not have an obvious intuitive answer. In this module’s conclusion, we will build a model to explain why E. coli’s randomized run and tumble walk algorithm is a clever way of locating resources in an unfamiliar land.Next lessonAdditional resourcesIf you find chemotaxis biology as interesting as we do, then we suggest the following resources. An amazing introduction to chemotaxis from the Parkinson Lab. A good overview of chemotaxis: Webre et al. 2003 A review article on chemotaxis pathway and MCPs: Baker et al. 2005. A more recent review article of MCPs: Parkinson et al. 2015. Lecture notes on robustness and integral feedback: Berg 2008. Amin DN, Hazelbauer GL. 2010. Chemoreceptors in signaling complexes: shifted conformation and asymmetric coupling. Available online ↩ Terwilliger TC, Wang JY, Koshland DE. 1986. Kinetics of Receptor Modification - the multiply methlated aspartate receptors involved in bacterial chemotaxis. The Journal of Biolobical Chemistry. Available online ↩ Spiro PA, Parkinson JS, and Othmer H. 1997. A model of excitation and adaptation in bacterial chemotaxis. Biochemistry 94:7263-7268. Available online. ↩ ↩2 Shimizu TS, Delalez N, Pichler K, and Berg HC. 2005. Monitoring bacterial chemotaxis by using bioluminescence resonance energy transfer: absence of feedback from the flagellar motors. PNAS. Available online ↩ Krembel A., Colin R., Sourijik V. 2015. Importance of multiple methylation sites in Escherichia coli chemotaxis. Available online ↩ Bray D, Bourret RB, Simon MI. 1993. Computer simulation of phosphorylation cascade controlling bacterial chemotaxis. Molecular Biology of the Cell. Available online ↩ "
} ,
{
"title" : "Anisotropic Network Models",
"category" : "",
"tags" : "",
"url" : "/coronavirus/anm",
"date" : "",
"content" : "ANMs account for the direction of protein fluctuationsThe generalization of a Gaussian network model, in which we attempt to determine the directions of the forces influencing alpha carbons, is called an anisotropic network model (ANM). Although ANMs include directionality, they typically perform worse than GNMs when benchmarked against experimental data1. However, ANMs can be used to create animations depicting the range of motions and fluctuations of the protein, as well as to estimate the specific directions of movement caused by each of a protein’s modes.We will not delve into the mathematical intricacies of ANM calculations, but we will use ANMs to create animations visualizing protein fluctuations. For example, click on the animation below to see a video of estimated hemoglobin fluctuations produced from ANM. We can see that the left and right side of the protein are more flexible than the rest of the protein and twist in opposite directions.Collective motions of the slowest mode in human hemoglobin from ANM calculations using DynOmics.After we produce an animation like the one in the figure above, we also should attempt to explain it biologically. Human hemoglobin exists in two states: the tense state (T), in which it is not bound to an oxygen molecule, and the relaxed state (R), in which it is oxygenated. Hemoglobin’s mobility shown in the above animation corresponds to its ability to transition between these two states, in which salt-bridges and contacts can shift by up to seven angstroms2. This significant molecular flexibility exemplifies why we need to study protein dynamics as well as structure.We will now apply ANM to the SARS-CoV and SARS-CoV-2 spike proteins. We will also use NMWiz, which is short for “normal mode wizard”, to perform ANM calculations and create an animation of the SARS-CoV-2 (chimeric) RBD and the SARS-CoV RBD.Visit tutorialNote: Although we have separated our discussion of GNM and ANM, the DynOmics 1.0 server is an effort to integrate these approaches on a user-friendly platform. If you are interested, an additional tutorial shows how to use DynOmics to replicate some of the analysis below.ANM analysis of the coronavirus binding domainThe following two animations show the complex of each virus’s RBD (purple) bound with ACE2 (green). The following tables indicate the color-coding of each animation.SARS-CoV spike protein RBD (PDB: 2ajf) SARS-CoV RBD Purple Resid 463 to 472 (Loop) Yellow Resid 442 (Hotspot 31) Orange Resid 487 (Hotspot 353) Red ACE2 Green Resid 79, 82, 83 (Loop) Silver Resid 31, 35 (Hotspot 31) Orange Resid 38, 353 (Hotspot 353) Red SARS-CoV-2 spike protein chimeric RBD (PDB: 6vw1) SARS-CoV-2 (Chimeric) RBD Purple Resid 476 to 486 (Loop) Yellow Resid 455 (Hotspot 31) Blue Resid 493 (Hotspot 31) Orange Resid 501 (Hotspot 353) Red ACE2 Green Resid 79, 82, 83 (Loop) Silver Resid 31, 35 (Hotspot 31) Orange Resid 38, 353 (Hotspot 353) Red Recall from the previous lesson that the greatest contribution of negative energy to the RBD/ACE2 complex in SARS-CoV-2 was the region called “hotspot 31”, which is highlighted in blue and orange in the above figures. If you look closely, as the protein swings in to bind with ACE2, the blue and orange regions appear to line up just a bit more naturally in the SARS-CoV-2 animation than in the SARS-CoV animation. That is, the improved binding that we hypothesized for a static structure appears to be confirmed by dynamics simulations. This animation provides yet one more piece of evidence that SARS-CoV-2 is more effective than SARS-CoV at binding to the ACE2 enzyme.Next lesson Yang, L., Song, G., Jernigan, R. 2009. Protein elastic network models and the ranges of cooperativity. PNAS 106(30), 12347-12352. https://doi.org/10.1073/pnas.0902159106 ↩ Davis, M., Tobi, D. 2014. Multiple Gaussian network modes alignments reveals dynamically variable regions: The hemoglobin case. Proteins: Structure, Function, and Bioinformatics, 82(9), 2097-2105. https://doi-org.cmu.idm.oclc.org/10.1002/prot.24565 ↩ "
} ,
{
"title" : "Gene Autoregulation is Surprisingly Frequent",
"category" : "",
"tags" : "",
"url" : "/motifs/autoregulation",
"date" : "",
"content" : "Using randomness to determine statistical significanceIn the previous lesson, we introduced the transcription factor network, in which a protein X is connected to a protein Y if X is a transcription factor that regulates the production of Y. We also saw that in the E. coli transcription factor network, there seemed to be a large number of loops, or edges connecting some transcription factor X to itself, and which indicate the autoregulation of X.In the introduction, we briefly referenced the notion of a network motif, a structure occurring often throughout a network. Our assertion is that the loop is a motif in the transcription factor network; how can we defend this claim?To argue that a loop is indeed a motif in the E. coli transcription factor network, we will apply a paradigm that occurs throughout computational biology (and science in general) when determining whether an observation is statistically significant. We will compare our observation against a randomly generated dataset. Without getting into the statistical details, if an observation is frequent in a real dataset, and rare in a random dataset, then it is likely to be statistically significant. Randomness saves the day again!STOP: How can we apply this paradigm of a randomly generated dataset to determine whether a transcription factor network contains a significant number of loops?Comparing a real transcription factor network against a random networkTo determine whether the number of loops in the transcription factor network of E. coli is statistically significant, we will compare this number against the expected number of loops we would find in a randomly generated transcription factor network. If the former is much higher than the latter, then we have strong evidence that some selective force is causing gene autoregulation.To generate a random network, we will use an approach developed by Edgar Gilbert in 19591. Given an integer n and a probability p (between 0 and 1), we form n nodes. Then, for every possible pair of nodes X and Y, we connect X to Y via a directed edge with probability p; that is, we simulate the process of flipping a weighted coin that has probability p of coming up “heads”.Note: Simulating a weighted coin flip amounts to generating a “random” number x between 0 and 1, and considering it “heads” if x is less than p and “tails” otherwise. For more details on random number generation, consult Programming for Lovers).STOP: What should n and p be if we are generating a random network to compare against the E. coli transcription factor network?The full E. coli transcription factor network contains thousands of genes, most of which are not transcription factors. As a result, the approach described above may form a random network that connects non-transcription factors to other nodes, which we should avoid.Instead, we will focus on the network comprising only those E. coli transcription factors that regulate each other. This network has 197 nodes and 477 edges, and so we will begin by forming a random network with n = 197 nodes.We then select p to ensure that our random network will on average have 477 edges. To do so, we note that there are n2 pairs of nodes that could have an edge connecting them (n choices for the starting node and n for the ending node). If we were to set p equal to 1/n2, then we would expect on average only to see a single edge in the random network. We therefore scale this value by 477 and set p equal to 477/n2 ≈ 0.0123 so that we will see, on average, 477 edges in our random network.In the following tutorial, we write some code to count the number of loops in the real E. coli transcription factor network. We then build a random network and compare the number of loops found in this network against the number of loops in the real network.Visit tutorialThe negative autoregulation motifIn a random network containing n nodes, the probability that a given edge is a loop is 1/n. Therefore, if the network has e edges, then we would on average see e/n loops in the network. In our case, n is 197 and e is 477; therefore, on average, we expect to see 197/497 ≈ 2.42 loops in a random network. Yet the real network of E. coli transcription factors that regulate each other contains 130 loops!Furthermore, in a random network, we would expect that about half of the edges correspond to activation, and the other half correspond to repression. But if you followed the preceding tutorial, then you know that of the 130 loops in the E. coli network, 35 correspond to activation and 95 correspond to repression. Just as you would be surprised to flip a coin 130 times and see “heads” 95 times, the cell must be negatively autoregulating for some reason.Not only is autoregulation an important feature of transcription factors, but these transcription factors tend to negatively autoregulate. Why in the world would organisms have evolved the process of autoregulation only for a transcription factor to slow down its own transcription? In the next lesson, we will begin to unravel the mystery.Next lesson Gilbert, E.N. (1959). “Random Graphs”. Annals of Mathematical Statistics. 30 (4): 1141–1144. doi:10.1214/aoms/1177706098. ↩ "
} ,
{
"title" : "Protein Biochemistry",
"category" : "",
"tags" : "",
"url" : "/coronavirus/biochemistry",
"date" : "",
"content" : "The four levels of protein structureProtein structure is a broad term that encapsulates four different levels of description. A protein’s primary structure refers to the amino acid sequence of the polypeptide chain. The primary structure of human hemoglobin subunit alpha can be downloaded here, and the primary structure of the SARS-CoV-2 spike protein, which we showed earlier, can be downloaded here.A protein’s secondary structure describes its highly regular, repeating intermediate substructures that form before the overall protein folding process completes. The two most common such substructures, shown in the figure below, are alpha helices and beta sheets. Alpha helices occur when nearby amino acids wrap around to form a tube structure; beta sheets occur when nearby amino acids line up side-by-side.The general shape of alpha helices (left) and beta sheets (right), the two most common protein secondary structures. Source: Cornell, B. (n.d.). https://ib.bioninja.com.au/higher-level/topic-7-nucleic-acids/73-translation/protein-structure.htmlA protein’s tertiary structure describes its final 3D shape after the polypeptide chain has folded and is chemically stable. Throughout this module, when discussing the “shape” or “structure” of a protein, we are almost exclusively referring to its tertiary structure. The figure below shows the tertiary structure of human hemoglobin subunit alpha. For the sake of simplicity, this figure does not show the position of every atom in the protein but rather represents the protein shape as a composition of secondary structures.The tertiary structure of human hemoglobin subunit alpha. Within the structure are multiple alpha helix secondary structures. Source: https://www.rcsb.org/structure/1SI4.Finally, some proteins have a quaternary structure, which describes the protein’s interaction with other copies of itself to form a single functional unit, or a multimer. Many proteins do not have a quaternary structure and function as an independent monomer. The figure below shows the quaternary structure of hemoglobin, which is a multimer consisting of two alpha subunits and two beta subunits.The quaternary structure of human hemoglobin, which consists of two alpha subunits (shown in red) and two beta subunits (shown in blue). Source: https://commons.wikimedia.org/wiki/File:1GZX_Haemoglobin.png.As for coronaviruses, the spike protein is a homotrimer, meaning that it is formed of three essentially identical units called chains, each one translated from the corresponding region of the coronavirus’s genome; these chains are colored differently in the figure below. In this module, when discussing the structure of the spike protein, we often are referring to the structure of a single chain.A side and top view of the quaternary structure of the SARS-CoV-2 spike protein homotrimer, with its three chains highlighted using different colors.The structural units making up proteins are often hierarchical, and the spike protein is no exception. Each spike protein chain is a dimer, consisting of two subunits called S1 and S2. Each of these subunits further divides into protein domains, distinct structural units within the protein that fold independently and are typically responsible for a specific interaction or function. For example, the SARS-CoV-2 spike protein has a receptor binding domain (RBD) located on the S1 subunit of the spike protein that is responsible for interacting with the human ACE2 enzyme; the rest of the protein does not come into contact with ACE2. We will say more about the RBD soon.Proteins seek the lowest energy conformationNow that we know a bit more about how protein structure is defined, we will discuss why proteins fold in the same way every time. In other words, what are the factors driving the magic algorithm?Amino acids’ side chain variety causes them to have different chemical properties, which can lead to some protein conformations being more chemically “preferable” than others. For example, the table below groups the twenty amino acids commonly occurring in proteins according to their chemical properties. Nine of these amino acids are hydrophobic (also called nonpolar), meaning that their side chains are repelled by water, and so we tend to find these amino acids tucked away on the interior of the protein.A chart of the twenty amino acid grouped by chemical properties. The side chain of each amino acid is highlighted in blue. Image courtesy: OpenStax Biology.We can therefore view protein folding as finding the tertiary structure that is the most stable given a polypeptide’s primary structure. A central theme of the previous module on bacterial chemotaxis was that a system of chemical reactions moves toward equilibrium. The same principle is true of protein folding; when a protein folds into its final structure, it reaches a conformation that is as chemically stable as possible.To be more precise, the potential energy (sometimes called free energy) of a molecule is the energy stored within an object due to its position, state, and arrangement. In molecular mechanics, the potential energy is made up of the sum of bonded energy and non-bonded energy.As the protein bends and twists into a stable conformation, bonded energy derives from the protein’s covalent bonds, as well as the bond angles between adjacent amino acids, and the torsion angles that we introduced in the previous lesson.Non-bonded energy comprises electrostatic interactions and van der Waals interactions. Electrostatic interactions refer to the attraction and repulsion force from the electric charge between pairs of charged amino acids. Two of the twenty standard amino acids (arginine and lysine) are positively charged, and two (aspartic acid and glutamic acid) are negatively charged. Two nearby amino acids of opposite charge may interact to form a salt bridge. Conformations that contain salt bridges and keep apart similarly charged amino acids will therefore have a lower free energy component contributed by electrostatic interactions.As for van der Waals interactions, atoms are dynamic systems, with electrons constantly buzzing around the nucleus, as shown in the figure below.A carbon-12 atom showing six positively charged protons (green), six neutrally charged neutrons (blue), and six negatively charged electrons (red). Under typical circumstances, the electrons will most likely be distributed uniformly around the nucleus.However, due to random chance, the negatively charged electrons in an atom could momentarily be unevenly distributed on one side of the nucleus. This uneven distribution will cause the atom to have a temporary negative charge on the side with the excess electrons and a temporary positive charge on the opposite side. As a result of this charge, one side of the atom may attract only the oppositely charged components of another atom, creating an induced dipole in that atom in turn as shown in the figure below. Van der Waals forces refer to the attraction and repulsion between atoms because of induced dipoles.Due to random chance, the electrons in the atom on the left have clustered on the left side of the atom, creating a net negative charge on this side of the atom, and therefore a net positive charge on the right side of the atom. This polarity induces a dipole in the atom on the right, whose electrons are attracted because of van der Waals forces.As the protein folds, it seeks a conformation of lowest total potential energy based on the combination of all the above-mentioned forces. For an analogy, imagine a ball on a slope, as shown in the following figure. The ball will move down the slope unless it is pushed uphill by some outside force, making it unlikely that the ball will wind up at the top of a hill. We will keep this analogy in mind as we return to the problem of protein structure prediction.A ball on a hill offers an analogy for a protein folding into the lowest energy structure. As the ball is more likely to move down into a valley, a protein is more likely to fold into a more stable, lower energy conformation.Next lesson"
} ,
{
"title" : "A Biochemically Accurate Model of Bacterial Chemotaxis",
"category" : "",
"tags" : "",
"url" : "/chemotaxis/biochemistry",
"date" : "",
"content" : "Transducing an extracellular signal to a cell’s interiorWe now turn to the question of how the cell conveys the extracellular signal it has detected via the process of signal transduction to the cell’s interior. In other words, when E. coli senses an increase in the concentration of glucose, meaning that more ligand-receptor binding is taking place at the receptor that recognizes glucose, how does the bacterium change its behavior?The engine of signal transduction is phosphorylation, a chemical reaction that attaches a phosphoryl group (PO3-) to an organic molecule. Phosphoryl modifications serve as an information exchange of sorts because, as we will see, they activate or deactivate certain enzymes.A phosphoryl group usually comes from one of two sources. First, the phosphoryl can be broken off from an adenosine triphosphate (ATP) molecule, the “energy currency” of the cell, producing adenosine diphosphate (ADP). Second, the phosphoryl can be exchanged from a phosphorylated molecule that loses its phosphoryl group in a dephosphorylation reaction.For many cellular responses, including bacterial chemotaxis, a sequence of phosphorylation and dephosphorylation events called a phosphorylation cascade serves to transmit information within the cell about the amount of ligand binding being detected on the cell’s exterior. In this lesson, we discuss how this cascade of chemical reactions leads to a change in bacterial movement.A high-level view of the transduction pathway for chemotaxis is shown in the figure below. The cell membrane receptors that we have been working with are called methyl-accepting chemotaxis proteins (MCPs), and they bridge the cellular membrane, binding both to ligand stimuli in the cell exterior and to other proteins on the inside of the cell. The pathway also includes a number of additional proteins, which all start with the prefix “Che” (short for “chemotaxis”).A summary of the chemotaxis transduction pathway. A ligand binding signal is propagated through CheA and CheY phosphorylation, which leads to a response of clockwise flagellar rotation. The blue curved arrow denotes phosphorylation, the grey curved arrow denotes dephosphorylation, and the blue dashed arrow denotes a chemical interaction. Our figure is a simplified view of Parkinson Lab illustrations.On the interior of the cellular membrane, MCPs form complexes with two proteins called CheW and CheA. In the absence of MCP-ligand binding, this complex is more stable, and the CheA molecule autophosphorylates, meaning that it adds a phosphoryl group taken from ATP to itself — a concept that might seem mystical if you had not already followed our discussion of autoregulation in the previous module.A phosphorylated CheA protein can pass on its phosphoryl group to a molecule called CheY, which interacts with the flagellum in the following way. Each flagellum has a protein complex called the flagellar motor switch that is responsible for controlling the direction of flagellar rotation. The interaction of this protein complex with phosphorylated CheY induces a change of flagellar rotation from counter-clockwise to clockwise. As we discussed earlier in the module, this change in flagellar rotation causes the bacterium to tumble, which in the absence of an increase in attractant occurs every 1 to 1.5 seconds.Yet when a ligand binds to the MCP, the MCP undergoes conformation changes, which reduce the stability of the complex with CheW and CheA. As a result, CheA is less readily able to autophosphorylate, which means that it does not phosphorylate CheY, which cannot change the flagellar rotation to clockwise, and so the bacterium is less likely to tumble.In summary, attractant ligand binding causes more phosphorylated CheA and CheY, which means that it causes fewer flagellar interactions and therefore less tumbling, which means that it causes the bacterium to run for a longer period of time.Note: A critical part of this process is that if a cell with a high concentration of CheY detects an attractant ligand, then it needs to decrease its CheY concentration quickly. Otherwise, the cell will not be able to change its tumbling frequency. To this end, the cell is able to dephosphorylate CheY using an enzyme called CheZ.Adding phosphorylation events to our model of chemotaxisWe would like to use the Gillespie algorithm that we introduced in the previous lesson to simulate the reactions driving chemotaxis signal transduction and see what happens if the bacterium “senses an attractant”, meaning that the attractant ligand’s concentration increases and leads to more receptor-ligand binding.This model will be more complicated than any we have introduced thus far. We will need to account for both bound and unbound MCP molecules, as well as phosphorylated and unphosphorylated CheA and CheY enzymes. We will also need to model CheA phosphorylation reactions, which depend on the current concentrations of bound and unbound MCP molecules. We will at least make the simplifying assumption that the MCP receptor is permanently bound to CheA and CheW, so that we do not need to represent these molecules individually. In other words, rather than thinking about CheA autophosphorylating, we will think about the receptor that includes CheA autophosphorylating.We will need six reactions. Two reversible reactions represent ligand-receptor binding: one for phosphorylated receptors, and another for unphosphorylated receptors. Two reactions represent MCP phosphorylation and take place at different rates based on whether the MCP is bound to a ligand (in our model, the phosphorylation rate is five times greater when the MCP is unbound). One reaction represent phosphorylation of CheY, and another reaction models dephosphorylation, which is mediated by the CheZ enzyme.Once we have built this model, we would like to see what happens when we change the concentrations of the ligand. Ideally, the bacterium should be able to distinguish different ligand concentrations. That is, the higher the concentration of an attractant ligand, the lower the concentration of phosphorylated CheY, and the lower the tumbling frequency of the bacterium.But does higher attractant concentration in our model really lead to a lower concentration of CheY? Let’s find out by incorporating the phosphorylation pathway into our ligand-receptor model in the following tutorial.Visit tutorialChanging ligand concentrations leads to a change in internal molecular concentrationsThe top panel of the following figure shows the concentrations of phosphorylated CheA and CheY in a system at equilibrium in the absence of ligand. As we might expect, these concentrations remain at steady state (with some healthy noise), and so the cell stays at its background tumbling frequency. The addition of 5,000 attractant ligand molecules increases the concentration of bound receptors, therefore leading to less CheA autophosphorylation and less phosphorylated CheY (middle panel). If we instead have 100,000 initial attractant molecules, then we see an even more drastic decrease in phosphorylated CheA and CheY (bottom panel).Molecular concentrations over time (in seconds) in a chemotaxis simulation for three different initial unbound attractant ligand concentrations: no attractant ligand (top), 5,000 ligand particles (middle), and 100,000 ligand particles (bottom). Note that the simulated cell’s bound ligand concentration (green) achieves equilibrium very quickly in each case.This model, powered by the Gillespie algorithm, confirms the biological observations that an increase in attractant reduces the concentration of phosphorylated CheY. The reduction takes place remarkably quickly, with the cell attaining a new equilibrium in a fraction of a second.The biochemistry powering chemotaxis may be elegant, but it is also simple, and so perhaps it is not surprising that the model’s particle concentrations reproduced the response of E. coli to an attractant ligand.But what we have shown in this lesson is just part of the story. In the next lesson, we will see that the biochemical realities of chemotaxis are more complicated, and for good reason — this added complexity will allow E. coli, and our model of it, to react to a dynamic world with surprising sophistication.Next lesson"
} ,
{
"title" : "Biological Modeling: A Short Tour",
"category" : "",
"tags" : "",
"url" : "/buy_the_book/",
"date" : "",
"content" : "Buy the bookWe are happy to let you know that the text companion to this course, Biological Modeling: A Short Tour, is available in both print and electronic formats! Print book: Get it from Amazon. E-book: Get it from Leanpub.About the bookBiological Modeling: A Short Tour offers readers a deep but concise exploration of topics in modeling biological systems at multiple scales. Each chapter poses a single biological question, from why zebras have stripes, to how bacteria explore their world intelligently, to why the SARS-CoV-2 spike protein was so effective at binding to human cells. The book then introduces the modeling concepts needed to answer this question. Some talented students at Carnegie Mellon University who helped build the Biological Modeling project appear in the book as chapter co-authors.Biological Modeling: A Short Tour follows the core content of the course, whereas the tutorials powering this core content are hosted on this website.Crowdfunding the publication of our bookPublication of Biological Modeling: A Short Tour was graciously funded by several hundreds backers via Kickstarter and Indiegogo. We are eternally grateful to these supporters who brought the book to life."
} ,
{
"title" : "An Overview of Classification and k-Nearest Neighbors",
"category" : "",
"tags" : "",
"url" : "/white_blood_cells/classification",
"date" : "",
"content" : "The classification problem and the iris flower datasetCategorizing images of WBCs according to family is a specific instance of a ubiquitous problem in data science, in which we wish to classify each object in a given dataset into one of k groups called classes. In our ongoing example, the data are images of WBC nuclei, and the classes are the three main families of WBCs (granulocytes, lymphocytes, and monocytes). To take a different example, our data could be tumor genomes sequenced from cancer patients, which we want to classify according to the therapeutic that should be prescribed for the patient. Or the data may be the past sales behavior of shoppers, whom we wish to classify into two classes based on a prediction of whether they will buy a new product.A classical dataset commonly used for motivating classification is the iris flower dataset, which was compiled by Edgar Anderson12 and used by Ronald Fisher in a seminal paper on classification in 19363. Anderson took morphological measurements from 150 iris flowers, evenly divided over three species (see figure below). Representative images of the three species of iris included in Anderson’s iris flower dataset. Image courtesies, from left to right: Emma Forsberg, unknown author, Robert H. Mohlenbrock. Anderson measured four attributes, or features, of each flower in his dataset: the width and height of the flower’s petal, and the width and height of the flower’s sepal (a green offshoot beneath the petals). The features and species labels for twelve flowers in the iris flower dataset are shown in the table below (click here for the full dataset). Fisher noticed that flowers from the same species had similar features and wondered whether it was possible to classify the flowers according to its species using only Anderson’s four features. Sepal length (cm) Sepal width (cm) Petal length (cm) Petal width (cm) Species 5.1 3.5 1.4 0.2 I. setosa 4.9 3.0 1.4 0.2 I. setosa 4.7 3.2 1.3 0.2 I. setosa 4.6 3.1 1.5 0.2 I. setosa 7.0 3.2 4.7 1.4 I. versicolor 6.4 3.2 4.5 1.5 I. versicolor 6.9 3.1 4.9 1.5 I. versicolor 5.5 2.3 4.0 1.3 I. versicolor 6.3 3.3 6.0 2.5 I. virginica 5.8 2.7 5.1 1.9 I. virginica 7.1 3.0 5.9 2.1 I. virginica 6.3 2.9 5.6 1.8 I. virginica A table containing values of the four features for twelve members of the iris flower dataset. The complete dataset was accessed from the University of California, Irvine Machine Learning Repository].STOP: What are typical feature values for flowers from each species in the table above? If presented with an iris of unknown species, how could you use these features to classify it?From flowers to vectorsIf we were to use only two of the four features in the iris flower dataset, then a flower’s feature values x and y could be represented as a point in two-dimensional space (x, y). The figure below shows such a plot for the features of petal length (x-axis) and petal width (y-axis).Petal length (x-axis) plotted against width (y-axis) for each of the flowers in the iris flower dataset, with data points colored by species. Although there were fifty flowers from each species, there are not fifty points corresponding to every species because some flowers have the same petal length and width and therefore occupy the same point.Note how stark the pattern in the above figure is. Even though we chose only two features from the iris flowers, the points associated with the flowers mostly divide into three main clusters by species.If we were to use all four features for the iris dataset, then every flower would be represented by a point in four-dimensional space. For example, the first flower in our initial table of iris features would be represented by the point (5.1, 3.5, 1.4, 0.2). In general, when classifying a collection of data with n features, each element in the dataset can be represented by a feature vector of length n, whose i-th value corresponds to the value of the data point’s i-th feature.Classifying unknown elements with k-nearest neighborsOur hope is that for datasets other than the iris flower dataset, elements from the same class will have feature vectors that are nearby in n-dimensional space. If so, then we can classify a data point whose class is unknown by determining which data points with known classification it is near.STOP: Consider the gray point with unknown class in the figure below. Should it be assigned to the class of the green points or to the class of the blue points?An unknown point (gray) along with a collection of nearby points belonging to two classes, colored green and blue.The preceding question indicates that classifying points can be surprisingly open-ended. Because of this freedom, researchers have devised a variety of different approaches for classifying data given data with known classes.We will discuss a simple but powerful classification algorithm called k-nearest neighbors, or k-NN4. In k-NN, we fix a positive integer k in advance. Then, for each point with unknown class, we assign it to the class possessed by the largest number of its k closest neighbors.In the ongoing example, if we were using k equal to 1, then we would assign the unknown point to the green class (see figure below).When k is equal to 1, k-NN classifies an unknown point according to the point of known class that is nearest; for this reason, the gray point above with unknown class would be assigned to the green class.However, with the same data and k equal to 4, the figure below shows that a majority of the k nearest neighbors are blue, and so we classify the unknown point as blue. This example reinforces a theme of this course, that the results of an algorithm can be sensitive to our choice of parameters.When using k-NN with k equal to 4, k-NN classifies the unknown point as blue, since three of its four closest neighbors are blue.STOP: When k is equal to 2 or 6 for the ongoing example, we obtain a tie in the number of points from each known class belonging to the k nearest neighbors of a point with unknown class. How could we break ties in k-NN?In the more general case in which feature vectors have n coordinates, we can determine which points are nearest to a given point by using the Euclidean distance, which generalizes the distance formula between vectors in three-dimensional space to the case of n-dimensional vectors. The Euclidean distance between vectors x = (x1, x2, …, xn) and y = (y1, y2, …, yn) is given by the sum of squares of differences between corresponding vector elements:\[d(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n-y_n)^2}\,.\]We now have learned how to use k-NN to classify feature vectors with unknown classes given vectors with known classes. There is just one problem: how can we convert an image of a WBC nucleus into a vector?Next lesson Anderson E (1935) The irises of the Gaspe Peninsula. Bulletin of the American Iris Society 59: 2-5. ↩ Anderson E (1936) The species problem in Iris. Annals of the Missouri Botanical Garden 23(3):457-509. Available online ↩ Fisher RA (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(2):179-188. Available online ↩ Fix E. and Hodges J.L. (1951) Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field. Available online ↩ "
} ,
{
"title" : "Conclusion: The Robustness of Biological Oscillators",
"category" : "",
"tags" : "",
"url" : "/motifs/conclusion",
"date" : "",
"content" : "Biological oscillators must be robustIf your heart skips a beat when you are watching a horror movie, then it should return quickly to its natural rhythm. When you hold your breath to dive underwater, your normal breathing resumes at the surface. And regardless of what functions your cells perform, they should be able to maintain a normal cell cycle despite disturbance.An excellent illustration of oscillator robustness is the body’s ability to handle jet lag. There is no apparent reason why we would have evolved to fly halfway around the world in a few hours. And yet our circadian clock is so resilient that after a few days of fatigue and crankiness, it returns us to a normal daily cycle.In the previous lesson, we saw that the repressilator, a three-element motif, can exhibit oscillations even in a noisy environment of randomly moving particles. The repressilator’s resilience makes us wonder how well it can respond to a disturbance in the concentrations of its particles.A coarse-grained repressilator modelWith our work on Turing patterns in the prologue, tracking the movements of many individual particles led to a slow simulation that did not scale well given more particles or reactions. This observation led us to devise a coarse-grained reaction-diffusion model that was still able to produce Turing patterns. We used a cellular automaton because the concentrations of particles varied in different locations and were diffusing at different rates.We would like to devise a coarse-grained model of the repressilator. However, the particles diffuse at the same rate and are uniform across the simulation, and so there is no need to track concentrations in individual locations. As a result, we will use a simulation that assumes that particles are well-mixed.For example, say that we are modeling a degradation reaction. If we start with a concentration of 10,000 X particles, then after a single time step, we will simply multiply the number of X particles by (1-k), where k is a parameter related to the rate of the degradation reaction.As for a repression reaction like X + Y → X, we can update the concentration of Y particles by subtracting some factor r times the current concentrations of X and Y particles.We will further discuss the technical details required to implement a well-mixed reaction-diffusion model in the next module. In the meantime, we would like to see what happens when we make a major disturbance to the concentration of one of the particles in the well-mixed model. Do particle concentrations resume their oscillations? To build this model of the repressilator, check out the following tutorial.Visit tutorialThe repressilator is robust to disturbanceThe figure below shows plots over time of particle concentrations in our well-mixed simulation of the repressilator. (Note that these plots are less noisy than the ones that we produced previously because we are assuming a well-mixed environment.) Midway through this simulation, we greatly increase the concentration of Y particles.A plot of particle concentrations in the well-mixed repressilator model over time. Adding a significant number of Y particles to our simulation (the second blue peak) produces little ultimate disturbance to the concentrations of the three particles, which return to normal oscillations within a single cycle.Because of the spike in the concentration of Y, the reaction Y + Z → Y suppresses the concentration of Z for longer than usual, and so the concentration of X is free to increase for longer than normal. As a result, the next peak of X particles is higher than normal.We might hypothesize that this process would continue, with a tall peak in the concentration of Z. However, the peak in the concentration of Z is no taller than normal, and the following peak shows a normal concentration of Y. The system has very quickly absorbed the blow of an increase in concentration of Y and returned to normal within one cycle.Even with a much larger jolt to the concentration of Y, the concentrations of the three particles return to normal oscillations very quickly, as shown below.A larger increase in the concentration of Y particles than in the previous figure does not produce a substantive change in the system.The repressilator is not the only network motif that leads to oscillations of particle concentrations, but robustness to disturbance is a shared feature of all these motifs. Furthermore, the repressilator is not the most robust oscillator that we can build. Researchers have shown that at least five components are typically needed to build a very robust oscillator,1 which may help explain why real oscillators tend to have more than three components.In the next module, we will encounter a much more involved biochemical process, with far more molecules and reactions, that is used by bacteria to cleverly (and robustly) explore their environment. In fact, we will have so many particles and so many reactions that we will need to completely rethink how we formulate our model.In the meantime, check out the exercises below to continue building your understanding of transcription factor networks and network motifs.Visit exercisesNext module Castillo-Hair, S. M., Villota, E. R., & Coronado, A. M. (2015). Design principles for robust oscillatory behavior. Systems and Synthetic Biology, 9(3), 125–133. https://doi.org/10.1007/s11693-015-9178-6 ↩ "
} ,
{
"title" : "Conclusion: Toward Deep Learning",
"category" : "",
"tags" : "",
"url" : "/white_blood_cells/conclusion",
"date" : "",
"content" : "A brief introduction to artificial neuronsThe best known classification algorithms for WBC image analysis1 use a technique called deep learning. You have probably seen this term wielded with reverence, and so in this chapter’s conclusion, we will briefly explain what it means and how it can be applied to classification.Neurons are cells in the nervous system that are electrically charged and that use this charge as a method of communication to other cells. As you are reading this text, huge numbers of neurons are firing in your brain as it processes the visual information that it receives. The basic structure of a neuron is shown in the figure below. The components of a neuron. Electrical signals are passed down axons and exchanged at terminal boundaries between neurons. Image courtesy: Jennifer Walinga.In 1943, Warren McCulloch (a neuroscientist) and Walter Pitts (a logician) devised an artificial model of a neuron that is now called a McCulloch-Pitts neuron.2 A McCulloch-Pitts neuron has a fixed threshold b and takes as input n binary variables x1, …, xn, where each of these variables is equal to either 0 or 1. The neuron outputs 1 if x1 + … + xn ≥ b; otherwise, it returns 0. If a McCulloch-Pitts neuron outputs 1, then we say that it fires.The McCulloch-Pitts neuron with n equal to 2 and b equal to 2 is shown in the figure below. The only way that this neuron will fire is if both inputs x1 and x2 are equal to 1. A McCullough-Pitts neuron with n equal to 2 and *b* equal to 2. The neuron fires when x1 + x2 is at least equal to *b*, which occurs precisely when both input variables are equal to 1; if either input variable is equal to 0, then x1 + x2 will be less than *b* and the neuron will not fire (i.e., it will output 0).Note: The mathematically astute reader may have noticed that the output of the McCulloch-Pitts neuron in the figure above is identical to the logical proposition x1 AND x2, which explains why these neurons started as a collaboration between a neuroscientist and a logician.In 1958, Frank Rosenblatt generalized the McCulloch-Pitts neuron into a perceptron.3 This artificial neuron also has a threshold constant b and n binary input variables xi, but it also includes a collection of real-valued constant weights wi that are applied to each input variable. That is, the neuron will output 1 (fire) when the weighted sum w1 · x1 + w2 · x2 + … + wn · xn is greater than or equal to b.Note: A McCulloch-Pitts neuron is a perceptron for which all of the wi are equal to 1.For example, consider the perceptron shown in the figure below; we assign the weight wi to the edge connecting input variable xi to the neuron. A perceptron with two input variables. The perceptron includes a constant threshold and constant weights w1 and w2. The perceptron outputs 1 when the weighted sum w1 · x1 + w2 · x2 is greater than or equal to *b*, and it outputs 0 otherwise.The modern concept of an artificial neuron, as shown in the figure below, generalizes the perceptron further in two ways. First, the input variables xi can have arbitrary decimal values (often, these inputs are constrained to be between 0 and 1). Second, rather than the neuron rigidly outputting 1 when w1 · x1 + w2 · x2 + … + wn · wn is greater than or equal to b, we subtract b from the weighted sum and pass the resulting value into a function f called an activation function; that is, the neuron outputs f(w1 · x1 + w2 · x2 + … + wn · wn). In this form of the neuron, b is called the bias of the neuron. A general form of an artificial neuron for two input variables x1 and x2, two constant weights w1 and w2, a constant bias *b*, and an activation function f. The output of the neuron, rather than being strictly 0 or 1, is f(w1 · x1 + w2 · x2 - *b*).A commonly used activation function is the logistic function, f(x) = 1/(1 + e-x), shown in the figure below. Note that the output of this function ranges between 0 (when its input is very negative) and 1 (when its input is very positive). A plot of the logistic function f(x) = 1/(1 + e-x), an increasing function whose values range between 0 and 1.STOP: Because of its simplicity, researchers now often use a “rectifier” activation function: f(x) = max(0, x). What does the graph of this function look like? What is the activation function used by a perceptron, and how does it differ from the rectifier function?Framing a classification problem using neural networksThe outputs of mathematical functions can be used as inputs to other functions via function composition. For example, if f(x) = 2x-1, g(x) = ex, and h(x) = cos(x), then h(g(f(x))) = cos(e2x-1). Similarly, we can use artificial neurons as building blocks by linking them together, with the outputs of some neurons serving as inputs to other neurons. Linking neurons in this way produces a neural network such as the one shown in the figure below, which we will take time to explain. An illustration of a potential neural network used for WBC image classification. This network assumes that each WBC is represented by *n* features, which serve as the input variables for the network. A number of hidden layers of additional neurons may be used, with connections between some of the neurons in adjacent layers. A final output layer of three neurons corresponds to each of the three WBC classes; our hope is that the weights of the neurons in the network are chosen so that the appropriate neuron outputs a value close to 1 corresponding to an image’s class, and that the other two neurons output values close to 0.We have discussed converting each object x in a dataset (such as our WBC image dataset) into a feature vector (x1, x2, …, xn) representing each of the n feature values of x. In the figure above, these n variables of the data’s feature vectors become the n input variables of the neural network.We typically then connect all the input variables to most or all of a collection of (possibly many) artificial neurons, called a hidden layer, which is shown for simplicity as a gray box in the figure above. If we have m artificial neurons in the hidden layer and n input variables, then we will have m bias constants as well as m · n weights, each one assigned to an edge connecting an input variable to a hidden layer neuron (all these edges are indicated by dashed edges in the figure above). Our model has quickly accumulated an enormous number of parameters!The first hidden layer of neurons may then be connected as inputs to neurons in another hidden layer, which are connected to neurons in another layer, and so on. As a result, practical neural networks may have several hidden layers with thousands of neurons, each with their own biases, input weights, and even different activation functions. The most common usage of the term deep learning refers to solving problems using neural networks having several hidden layers; the discussion of the many challenges in designing neural networks for deep learning would fill an entire course.The remaining question is what the output of our neural network should be. If we would like to apply the network to classify our data into k classes, then we typically will connect the final hidden layer of neurons to k output neurons. Ideally, if we know that a data point x belongs to the i-th class, then when we use the values of its feature vector as input to the network, we would like for the i-th output neuron to output a value close to 1, and for all other output neurons to output a value close to 0. For a neural network to correctly classify objects in our dataset, we must find such an ideal choice for the biases of each neuron and the weights assigned to input variables at each neuron — assuming that we have decided on which activation function(s) to use for the network’s neurons. We will now define quantitatively what makes a given choice of neural network parameters suitable for classification.STOP: Say that a neural network has 100 input variables, three output neurons, and four hidden layers of 1000 neurons each. Say also that every neuron in one layer is connected as an input to every neuron in the next layer. How many bias parameters will this network have? How many weight parameters will this network have?Defining the best choice of parameters for a neural networkIn a previous lesson, we discussed how to assess the results of a classification algorithm like k-NN on a collection of data with known classes. To generalize this idea to our neural network classifier, we divide our data that have known classes into a training set and a test set, where the former is typically much larger than the latter. We then seek the choice of parameters for the neural network that “performs the best” on the training set, which we will now explain.Each data point x in the training set has a ground truth classification vector c(x) = (c1, c2, …, ck),, where if x belongs to class j, then cj is equal to 1, and the other elements of c(x) are equal to 0. The point x also has an output vector o(x) = (o1, o2, …, ok), where for each i, oi is the output of the i-th output neuron in the network when x is used as the network’s input. The neural network is doing well at identifying the class of x when the classification vector c(x) is similar to the output vector o(x).Fortunately, we have been using a method of comparing two vectors throughout this book. The RMSD between c(x) and o(x) measures how well the network classified data point x, with a value close to 0 representing a good fit. We can obtain a good measure of how well a neural network with given weight and bias parameters performs on a training set on the whole by taking the average RMSD between classification and output vectors over every element in the training set. We therefore would like to choose the biases and input weights for the neural network that minimize this average RMSD for all objects in the training set.Once we have chosen a collection of bias and weight parameters for our network that perform well on the training set, we then assess how well these parameters performs on the test set. To this end, we can insert the feature vector of each test set object x as input into the network and consult the output vector o(x). Whichever i maximizes oi for this output vector becomes the assigned class of x. We can then use the metrics introduced previously in this module for quantifying the quality of a classifier to determine how well the neural network performs at classifying objects from the test set.This discussion has assumed that we can easily determine the best choice of network parameters to produce a low mean RMSD between output and classification vectors for the training set. But how can we find this set of parameters in the first place?Exploring a neural network’s parameter spaceThe typical neural network contains anywhere from thousands to billions of biases and input weights. We can think of these parameters as forming the coordinates of a vector in a high-dimensional space. From the perspective of producing low mean RMSD between output and classification vectors over a collection of training data, the vast majority of the points in this space (i.e., choices of network parameters) are worthless. In this vast landscape, a tiny number of these parameter choices will provide good results on our training set; even with substantial computational resources, finding one of these points is daunting.The situation in which we find ourselves is remarkably similar to one we have encountered throughout this course, in which we need to explore a search space for some object that optimizes a function. We would therefore like to design a local search algorithm to explore a neural network’s parameter space.As with ab initio structure prediction, we could start with a random choice of parameters, make a variety of small changes to the parameter values to obtain a set of “nearby” parameter vectors, and update our current parameter vector to the parameter vector from this set that produces the greatest decrease in mean RMSD between output and classification vectors. We then continue this process of making small changes to the current parameter vector until this mean RMSD stops decreasing. This local search algorithm is similar to the most popular approach for determining parameters for a neural network, called gradient descent.STOP: What does a local minimum mean in the context of neural network parameter search?Just as we run ab initio structure prediction algorithms using many different initial protein conformations, we should run gradient descent for many different sets of randomly chosen initial parameters. In the end, we take the choice of parameters minimizing mean RMSD over all these trials.Note: If you find yourself interested in deep learning and would like to learn more, check out the excellent Neural Networks and Deep Learning online book by Michael Nielsen.Neural network pitfalls, Alphafold, and final reflectionsNeural networks are wildly popular, but they have their own issues. Because we have so much freedom for how the neural network is formed, it is challenging to know a priori how to design an “architecture” for how the neurons should be connected to each other for a given problem.Once we have decided on an architecture, the neural network has so many bias and weight parameters that even with access to a supercomputer, it may be difficult to find values for these parameters that perform even reasonably well on the training set; the situation of having parameters with high RMSD for the training set is called “underfitting”. Even if we build a neural network having low mean RMSD for the training set, the neural network may perform horribly on the test set, which is called “overfitting” and offers yet another instance of the curse of dimensionality.Despite these potential concerns with neural networks, they are starting to show promise of making significant progress in solving biological problems. AlphaFold, which we introduced when discussing protein folding, is powered by neural networks that contain 21 million parameters. Yet although AlphaFold has revolutionized the study of protein folding, just as many problems exist for which neural networks are struggling to make progress over existing methods. Biology largely remains, like the environment of a lonely bacterium, an untouched universe waiting to be explored.Thank you!If you are reading this, and you’ve made it through our entire course, thank you for joining us on this journey! We are grateful that you gave your time to us, and we wish you the best on your educational journey. Please don’t hesitate to contact us if you have any questions, feedback, or would like to leave us a testimonial; we would love to hear from you.Visit exercises Habibzadeh M, Jannesari M, Rezaei Z, Baharvand H, Totonchi M. Automatic white blood cell classification using pre-trained deep learning models: ResNet and Inception. Proc. SPIE 10696, Tenth International Conference on Machine Vision (ICMV 2017), 1069612 (13 April 2018). Available online ↩ McCulloch WS, Pitts WS 1943. A Logical calculus of the ideas Immanent in nervous activity. The bulletin of mathematical biophysics (5): 115–133. Available online ↩ Rosenblatt M 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review 65 (6): 386. ↩ "
} ,
{
"title" : "Conclusion: Turing Patterns are Fine-Tuned",
"category" : "",
"tags" : "",
"url" : "/prologue/conclusion",
"date" : "",
"content" : "In both the particle-based and automaton model for Turing patterns, we observed that the model is fine-tuned, meaning that very slight changes in parameter values can lead to significant changes in the system. These changes could convert spots to stripes, or they could influence how clearly defined the boundaries of the Turing patterns are.The figure below shows how the Turing patterns produced by the Gray-Scott model change as the kill and feed rates vary. The square at position (x, y) shows the pattern obtained as the result of a Gray-Scott simulation with kill rate x and feed rate y. Notice how much the patterns change! You may like to tweak the parameters of the Gray-Scott simulation from the previous lesson to see if you can reproduce these differing patterns.Changing kill (x-axis) and feed (y-axis) parameters greatly affects the Turing patterns obtained in the Gray-Scott model. Each small square shows the patterns obtained from a given choice of feed and kill rate. Note that many choices of parameters do not produce Turing patterns, which only result from a narrow “sweet spot” band of parameter choices. Image courtesy: Robert Munafo.1Later in this course, we will see an example of a biological system that is the opposite of fine-tuned. In a robust system, perturbations such as parameter variations do not lead to substantive changes in the ultimate behavior of the system. Robustness is vital for processes, like your heartbeat, that must be resilient to small environmental changes.It turns out that although Turing’s work offers a compelling argument for how zebras might have gotten their stripes, the cellular mechanism causing these stripes to form remains unidentified. However, researchers have shown that the skin of zebrafish does exhibit Turing patterns because two types of pigment cells serve as “particles” following a reaction-diffusion model much like the one we presented in this prologue.2Finally, take a look at the following two photos of giant pufferfish.34 Genetically, these fish are practically identical, but their skin patterns are very different. What may seem like a drastic change from spots to stripes is likely attributable to a small change of parameters in a fine-tuned biological system that, like much of life, is powered by randomness. Two similar pufferfish with very different skin Turing patterns. (Left) A juvenile Mbu pufferfish with spotted skin. (Right) An adult Mbu pufferfish with striped skin. A final noteThank you for making it this far! We hope that you are enjoying the course. You can visit the exercises for this module or skip ahead to the next module by clicking on the appropriate button below. We also ask that you complete the course survey if you have not done so already.Visit exercisesNext module “Reaction-Diffusion by the Gray-Scott Model: Pearson’s Parametrization” © 1996-2020 Robert P. Munafo https://mrob.com/pub/comp/xmorphia/index.html ↩ Nakamasu, A., Takahashi, G., Kanbe, A., & Kondo, S. (2009). Interactions between zebrafish pigment cells responsible for the generation of Turing patterns. Proceedings of the National Academy of Sciences of the United States of America, 106(21), 8429–8434. https://doi.org/10.1073/pnas.0808622106 ↩ NSG Coghlan, 2006 Creative Commons Attribution-Share Alike 3.0 Unported ↩ Chiswick Chap, 20 February 2012, Creative Commons Attribution-Share Alike 3.0 Unported ↩ "
} ,
{
"title" : "Conclusion: The Beauty of *E. coli*'s Robust Randomized Exploration Algorithm",
"category" : "",
"tags" : "",
"url" : "/chemotaxis/conclusion",
"date" : "",
"content" : "Two randomized exploration strategiesIn the prologue, we saw that a particle taking a collection of n unit steps in random directions will wind up on average a distance proportional to \(\sqrt{n}\) units away from its starting position. We now will compare such a random walk against a modified algorithm that emulates the behavior of E. coli by changing the length of a step (i.e., how long the bacterium tumbles) based on the relative change in background attractant concentration.We will represent a bacterium as a particle traveling in two-dimensional space. Units of distance will be measured in µm; recall from the introduction that a bacterium can cover 20 µm in a second during an uninterrupted run. The bacterium will start at the origin (0, 0).We will use L(x, y) to denote the ligand concentration at (x, y) and establish a point (called the goal) at which L(x, y) is maximized. We will place the goal at (1500, 1500), so that the bacterium must travel a significant distance from the origin to reach the goal.We would like the ligand concentration L(x, y) to decrease exponentially the farther we travel from the goal. We therefore set L(x, y) = 100 · 106 · (1-d/D), where d is the distance from (x, y) to the goal, and D is the distance from the origin to the goal, which in this case is 1500√2 ≈ 2121 µm. At the origin, the attractant concentration is equal to 100, and at the goal, the attractant concentration is equal to 100,000,000.STOP: How can we quantify how well a bacterium has done at finding the attractant?We are comparing two different cellular behaviors, and so in the spirit of Module 1, we will simulate many random walks of a particle following each of the two strategies, described in what follows. (The total time needed by our simulation should be large enough to allow the bacterium to have enough time to reach the goal.) For each strategy, we will then measure how far on average a bacterium with each strategy is from the goal at the end of the simulation.Strategy 1: Standard random walkTo model a particle following an “unintelligent” random walk strategy, we first select a random direction of movement along with a duration of tumble. The angle of reorientation is a random number selected uniformly between 0° and 360°. The duration of each tumble is a “wait time” of sorts and follows an exponential distribution with experimentally verified mean equal to 0.1 seconds1. As the result of a tumble, the cell only changes its orientation, not its position.We then select a random duration to run and let the bacterium run in that direction for the specified amount of time. The duration of each run follows an exponential distribution with mean equal to the experimentally verified value of 1 second.We then iterate the two steps of tumbling and running until the total time allocated for the simulation has elapsed.In the following tutorial, we simulate this naive strategy using a Jupyter notebook that will also help us visualize the results of the simulation.Visit tutorialStrategy 2: Chemotactic random walkIn our second strategy, which we call the “chemotactic strategy”, we mimic the real response of E. coli to its environment based on what we have learned about chemotaxis throughout this module. The simulated bacterium will still follow a run and tumble model, but the duration of each run, which is a function of the bacterium’s tumbling frequency, will depend on the relative change in attractant concentration that it detects.To ensure a mathematically controlled comparison, we will use the same approach for sampling the duration of a tumble and the direction of a run as in the first strategy.We have seen in this module that it takes E. coli about half a second to respond to a change in attractant concentration. We use tresponse to denote this “response time”; to produce a reasonable model of chemotaxis, we will check the attractant concentration of a running particle at the particle’s current location every tresponse seconds.We will then measure the percentage difference between the attractant concentration L(x, y) at the cell’s current point and the attractant concentration at the cell’s previous point, tresponse in the past; we denote this difference as Δ[L]. If Δ[L] is equal to zero, then the probability of a tumble in the next tresponse seconds should be the same as the likelihood of a tumble in the first strategy over the same time period. If Δ[L] is positive, then the probability of a tumble should be greater than it was in strategy 1; if Δ[L] is negative, then the probability of a tumble should be less than it was in strategy 1.To model the relationship between the likelihood of a tumble and the value of Δ[L], we will let t0 denote the mean background run duration, which in the first strategy was equal to one second. We would like to use a simple formula for the expected run duration, such as t0 * (1 + 10 · Δ[L]).Unfortunately, there are two issues with this formula. First, if Δ[L] is less than -0.1, then the run duration could be negative. Second, if Δ[L] is large, then the bacterium will run for so long that it could reach the goal and run past it.To fix the first issue, we will first take the maximum of t0 * (1 + 10 · Δ[L]) and some small positive number c (we will use c equal to 0.000001). As for the second issue, we will then take the minimum of the resulting expression and 4 · t0. The final value,μ = min(max(t0 * (1 + 10 · Δ[L]), c), 4 · t0),becomes the mean run duration of a bacterium based on the recent relative change in concentration.STOP: What is the mean run duration μ when Δ[L] is equal to zero? Is this what we would hope?As with the first strategy, our simulated cell will alternate between tumbling and running in a random direction until the total time devoted to the simulation has elapsed. The only difference in the second strategy is that we will measure the percentage change in concentration Δ[L] between a cell’s current point and its previous point every tresponse seconds. After determining a mean run time μ according to the expression above, we will sample a random number p from an exponential distribution with mean μ, and the cell will tumble after p seconds if p is smaller than tresponse.In the following tutorial, we will adapt the Jupyter notebook that we built in the previous tutorial to simulate this second strategy and run it many times, taking the average of the simulated bacteria’s distance to the goal.Visit tutorialComparing the effectiveness of our two random walk strategiesThe following figure visualizes the trajectories of three cells over 500 seconds using strategy 1 (left) and strategy 2 (right) with a default tumbling frequency t0 of one second. Unlike the cells following strategy 1, the cells following strategy 2 quickly hone in on the goal and remain near it.Three sample trajectories for the standard random walk strategy (left) and chemotactic random walk strategy (right). The standard random walk strategy is shown on the left, and the chemotactic random walk is shown on the right. Redder regions correspond to higher concentrations of ligand, with a goal having maximum concentration at the point (1500, 1500), which is indicated with a blue square. Each particle’s walk is colored from darker to lighter colors across the time frame of its trajectory.We should be wary of such a small sample size. To confirm that what we observed in these trajectories is true in general, we will compare the two strategies over many simulations. The following figure visualizes the particle’s average distance to the goal over 500 simulations for both strategies and confirms our previous observation that strategy 2 is effective at guiding the simulated particle to the goal. And yet the direction of travel in this strategy is random, so why would this strategy be so successful?Distance to the goal plotted over time for 500 simulated particles following the standard random walk strategy (pink) and the chemotactic random walk strategy (green). The dark lines indicate the average distance over all simulations, and the shaded area around each line represents one standard deviation from the average.The chemotactic strategy works because of a “rubber band” effect. If the bacterium is traveling down an attractant gradient (i.e., away from an attractant), then it is not allowed to travel very far in a single step before it is forced to tumble. If an increase of attractant is detected, however, then the cell can travel farther in a single direction before tumbling. On average, this effect serves to pull the bacterium in the direction of increasing attractant, even though the directions in which it travels are random.A tiny change to a simple, unsuccessful randomized algorithm can therefore produce an elegant approach for exploring an unknown environment. But we left one more question unanswered: why is the default frequency of one tumble per second stable across a wide range of bacteria? To address this question, we will see how changing t0, the default time for a run step in the absence of change in attractant concentration, affects the ability of a simulated bacterium following strategy 2 to reach the goal. You may like to adjust the value of t0 in the previous tutorial yourself, or follow the tutorial below.Visit tutorialWhy is background tumbling frequency constant across bacterial species?The following figures show three trajectories for a few different values of t0 and a simulation that lasts for 800 seconds. First, we set t0 equal to 0.2 seconds and see that the simulated bacteria are not able to walk far enough in a single step to head toward the goal.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 0.2 seconds.If we increase t0 to 5.0 seconds, then cells can run for so long that they may run past the goal without being able to brake by tumbling.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 5.0 seconds.When we set t0 equal to 1.0, then the figure below shows a “Goldilocks” effect in which the simulated bacterium can run for long enough at a time to head quickly toward the goal, and it tumbles frequently enough to keep it near the goal.Three sample trajectories of a simulated cell following the chemotactic random walk strategy with an average run time between tumbles t0 of 1.0 seconds.The figure below visualizes average particle distance to the goal over time for 500 particles using a variety of choices of t0. It confirms that tumbling every second by default is “just right” for finding an attractant.Average distance to the goal over time for 500 cells. Each colored line indicates the average distance to the goal over time for a different value of t0; the shaded area represents one standard deviation.Below, we reproduce the video from earlier in this module showing E. coli moving towards a sugar crystal. This video shows that the behavior of real E. coli is similar to that of our simulated bacteria. Bacteria generally move towards the crystal and then remain close to it; some bacteria run by the crystal, but they turn around to move toward the crystal again. Bacteria are even smarter than we thoughtIf you closely examine the video above, then you may be curious about the way that bacteria turn around and head back toward the attractant. When they reorient, their behavior appears more intelligent than simply walking in a random direction. As is often true in biology, the reality of the system that we are studying turns out to be more complex than we might at first imagine.The direction of bacterial reorientation is not completely random, but rather follows a normal distribution with mean of 68° and standard deviation of 36°2. That is, the bacterium typically does not tend to make as drastic of a change to its orientation as it would in a pure random walk, which