Articles

8.1E: Exercises - Mathematics


Practice Makes Perfect

In the following exercises, determine the values for which the rational expression is undefined.

Example (PageIndex{49})

  1. (frac{2x}{z})
  2. (frac{4p−1}{6p−5})
  3. (frac{n−3}{n^2+2n−8})
Answer
  1. z=0
  2. (p=frac{5}{6})
  3. n=−4, n=2

Example (PageIndex{50})

  1. (frac{10m}{11n})
  2. (frac{6y+13}{4y−9})
  3. (frac{b−8}{b^2−36})

Example (PageIndex{51})

  1. (frac{4x^{2}y}{3y})
  2. (frac{3x−2}{2x+1})
  3. (frac{u−1}{u^2−3u−28})
Answer
  1. y=0
  2. (x=−frac{1}{2})
  3. u=−4, u=7

Example (PageIndex{52})

  1. (frac{5pq^{2}}{9q})
  2. (frac{7a−4}{3a+5})
  3. (frac{1}{x^2−4})

Evaluate Rational Expressions

In the following exercises, evaluate the rational expression for the given values.

Example (PageIndex{53})

(frac{2x}{x−1})

  1. x=0
  2. x=2
  3. x=−1
Answer
  1. 0
  2. 4
  3. 1

Example (PageIndex{54})

(frac{4y−1}{5y−3})

  1. y=0
  2. y=2
  3. y=−1

Example (PageIndex{55})

(frac{2p+3}{p^2+1})

  1. p=0
  2. p=1
  3. p=−2
Answer
  1. 3
  2. (frac{5}{2})
  3. (−frac{1}{5})

Example (PageIndex{56})

(frac{x+3}{2−3x})

  1. x=0
  2. x=1
  3. x=−2

Example (PageIndex{57})

(frac{y^2+5y+6}{y^2−1})

  1. y=0
  2. y=2
  3. y=−2
Answer
  1. −6
  2. (frac{20}{3})
  3. 0

Example (PageIndex{58})

(frac{z^2+3z−10}{z^2−1})

  1. z=0
  2. z=2
  3. z=−2

Example (PageIndex{59})

(frac{a^2−4}{a^2+5a+4})

  1. a=0
  2. a=1
  3. a=−2
Answer
  1. −1
  2. (−frac{3}{10})
  3. 0

Example (PageIndex{60})

(frac{b^2+2}{b^2−3b−4})

  1. b=0
  2. b=2
  3. b=−2

Example (PageIndex{61})

(frac{x^2+3xy+2y^2}{2x^{3}y})

  1. x=1, y=−1
  2. x=2, y=1
  3. x=−1, y=−2
Answer
  1. 0
  2. (frac{3}{4})
  3. (frac{15}{4})

Example (PageIndex{62})

(frac{c^2+cd−2d^2}{cd^{3}})

  1. c=2, d=−1
  2. c=1, d=−1
  3. c=−1, d=2

Example (PageIndex{63})

(frac{m^2−4n^2}{5mn^3})

  1. m=2, n=1
  2. m=−1, n=−1
  3. m=3, n=2
Answer
  1. 0
  2. (−frac{3}{5})
  3. (−frac{7}{20})

Example (PageIndex{64})

(frac{2s^{2}t}{s^2−9t^2})

  1. s=4, t=1
  2. s=−1, t=−1
  3. s=0, t=2

​​​​​​​Simplify Rational Expressions

In the following exercises, simplify.

Example (PageIndex{65})

(−frac{4}{52})

Answer

(−frac{1}{13})

Example (PageIndex{66})

(−frac{44}{55})

Example (PageIndex{67})

(frac{56}{63})

Answer

(frac{8}{9})

Example (PageIndex{68})

(frac{65}{104})

Example (PageIndex{69})

(frac{6ab^{2}}{12a^{2}b})

Answer

(frac{b}{2ab})

Example (PageIndex{70})

(frac{15xy^{3}}{x^{3}y^{3}})

Example (PageIndex{71})

(frac{8m^{3}n}{12mn^2})

Answer

(frac{2m^2}{3n})

Example (PageIndex{72})

(frac{36v^{3}w^2}{27vw^3})

Example (PageIndex{73})

(frac{3a+6}{4a+8})

Answer

(frac{3}{4})

Example (PageIndex{74})

(frac{5b+5}{6b+6})

Example (PageIndex{75})

(frac{3c−9}{5c−15})

Answer

(frac{3}{5})

Example (PageIndex{76})

(frac{4d+8}{9d+18})

Example (PageIndex{77})

(frac{7m+63}{5m+45})

Answer

(frac{7}{5})

Example (PageIndex{78})

(frac{8n−96}{3n−36})

Exercise (PageIndex{79})

(frac{12p−240}{5p−100})

Answer

(frac{12}{5})

Example (PageIndex{80})

(frac{6q+210}{5q+175})

Example (PageIndex{81})

(frac{a^2−a−12}{a^2−8a+16})

Answer

(frac{a+3}{a−4})

Example (PageIndex{82})

(frac{x^2+4x−5}{x^2−2x+1})

Example (PageIndex{83})

(frac{y^2+3y−4}{y^2−6y+5})

Answer

(frac{y+4}{y−5})

Example (PageIndex{84})

(frac{v^2+8v+15}{v^2−v−12})

Example (PageIndex{85})

(frac{x^2−25}{x^2+2x−15})

Answer

(frac{x−5}{x−3})

Example (PageIndex{86})

(frac{a^2−4}{a^2+6a−16})

Example (PageIndex{87})

(frac{y^2−2y−3}{y^2−9})

Answer

(frac{y+1}{y+3})

Example (PageIndex{88})

(frac{b^2+9b+18}{b^2−36})

Example (PageIndex{89})

(frac{y^3+y^2+y+1}{y^2+2y+1})

Answer

(frac{y^2+1}{y+1})​​​​​​​

Example (PageIndex{90})

(frac{p^3+3p^2+4p+12}{p^2+p−6})

Example (PageIndex{91})

(frac{x^3−2x^2−25x+50}{x^2−25})

Answer

x−2

Example (PageIndex{92})

(frac{q^3+3q^2−4q−12}{q^2−4})

Example (PageIndex{93})

(frac{3a^2+15a}{6a^2+6a−36})

Answer

(frac{a(a+5)}{2(a+3)(a−2)})

Example (PageIndex{94})

(frac{8b^2−32b}{2b^2−6b−80})

Example (PageIndex{95})

(frac{−5c^2−10c}{−10c^2+30c+100})

Answer

(frac{c}{2(c−5)})

Example (PageIndex{96})

(frac{4d^2−24d}{2d^2−4d−48})

Example (PageIndex{97})

(frac{3m^2+30m+75}{4m^2−100})

Answer

(frac{3(m+5)}{4(m−5)})

Example (PageIndex{98})

(frac{5n^2+30n+45}{2n^2−18})

Example (PageIndex{99})

(frac{5r^2+30r−35}{r^2−49})

Answer

(frac{5(r−1)}{r+7})

Example (PageIndex{100})

(frac{3s^2+30s+72}{3s^2−48})

Example (PageIndex{101})

(frac{t^3−27}{t^2−9})​​​​​​​

Answer

(frac{t^2+3t+9}{t+3})

Example (PageIndex{102})

(frac{v^3−1}{v^2−1})

Example (PageIndex{103})

(frac{w^3+216}{w^2−36})

Answer

(frac{w^2−6w+36}{w−6})

Example (PageIndex{104})

(frac{v^3+125}{v^2−25})

Simplify Rational Expressions with Opposite Factors

In the following exercises, simplify each rational expression.

Example (PageIndex{105})

(frac{a−5}{5−a})

Answer

−1

Example (PageIndex{106})

(frac{b−12}{12−b})

Example (PageIndex{107})

(frac{11−c}{c−11})

Answer

−1

Example (PageIndex{108})

(frac{5−d}{d−5})

Example (PageIndex{109})

(frac{12−2x}{x^2−36})

Answer

(−frac{2}{x+6})

Example (PageIndex{110})

(frac{20−5y}{y^2−16})

Example (PageIndex{111})

(frac{4v−32}{64−v^2})

Answer

(−frac{4}{8+v})

Example (PageIndex{112})

(frac{7w−21}{9−w^2})

Example (PageIndex{113})

(frac{y^2−11y+24}{9−y^2})

Answer

(−frac{y−8}{3+y})

Example (PageIndex{114})

(frac{z^2−9z+20}{16−z^2})

Example (PageIndex{115})

(frac{a^2−5a−36}{81−a^2})

Answer

(−frac{a+4}{9+a})​​​​​​​

Example (PageIndex{116})

(frac{b^2+b−42}{36−b^2})​​​​​​​

Everyday Math

Example (PageIndex{117})

Tax Rates For the tax year 2015, the amount of tax owed by a single person earning between $37,450 and $90,750, can be found by evaluating the formula 0.25x−4206.25, where x is income. The average tax rate for this income can be found by evaluating the formula (frac{0.25x−4206.25}{x}). What would be the average tax rate for a single person earning $50,000?

Answer

16.5%

Example (PageIndex{118})

Work The length of time it takes for two people for perform the same task if they work together can be found by evaluating the formula (frac{xy}{x+y}). If Tom can paint the den in x=45 minutes and his brother Bobby can paint it in y=60 minutes, how many minutes will it take them if they work together?

Writing Exercises

Example (PageIndex{119})

Explain how you find the values of x for which the rational expression (frac{x^2−x−20}{x^2−4}) is undefined.​​​​​​​

Answer

Answers will vary, but all should reference setting the denominator function to zero.

Example (PageIndex{120})

Explain all the steps you take to simplify the rational expression (frac{p^2+4p−21}{9−p^2}).​​​​​​​

Self Check

ⓐ After completing the exercises, use this checklist to evaluate your mastery of the objectives of this section.

ⓑ If most of your checks were:

…confidently. Congratulations! You have achieved your goals in this section! Reflect on the study skills you used so that you can continue to use them. What did you do to become confident of your ability to do these things? Be specific!

…with some help. This must be addressed quickly as topics you do not master become potholes in your road to success. Math is sequential - every topic builds upon previous work. It is important to make sure you have a strong foundation before you move on. Who can you ask for help? Your fellow classmates and instructor are good resources. Is there a place on campus where math tutors are available? Can your study skills be improved?

…no - I don’t get it! This is critical and you must not ignore it. You need to get help immediately or you will quickly be overwhelmed. See your instructor as soon as possible to discuss your situation. Together you can come up with a plan to get you the help you need.


Calculus graphical numerical algebraic 4th edition pdf

Calculus: Graphical, Numerical, Algebraic, 3rd Edition Answers Ch 4 Applications of Derivatives Ex 4.3 Calculus: Graphical, Numerical, Algebraic Answers Chapter 4 Applications of Derivatives Exercise 4.3 1E Chapter 4 Applications of Derivatives Exercise 4.3 1QQ Chapter 4 Applications of Derivatives Exercise 4.3 1QR Chapter 4 Applications of Derivatives … [Read more. ] about Calculus: Graphical, Numerical, Algebraic, 3rd Edition Answers Ch 4 Applications of Derivatives Ex 4.3


6.2 Exercises

You can download a template RMarkdown file to start from here.

We’ll explore LASSO modeling using the Hitters dataset in the ISLR package (associated with the optional textbook). You’ll need to install the ISLR package in the Console first. You should also install the glmnet package as we’ll be using it subsequently for fitting LASSO models.

The Hitters dataset contains a number of stats on major league baseball players in 1987. Our goal will be to build a regression model that predicts player Salary .

  1. Get to know the Hitters data
    1. Peek at the first few rows.
    2. How many players are in the dataset?
    3. How many possible predictors of salary are there?

    Developing some intuition
    A natural model to start with is one with all possible predictors. The following model is fit with ordinary (not penalized) least squares:

    1. Use caret to perform 7-fold cross-validation to estimate the test error of this model. Use the straight average of the RMSE column instead of squaring the values first. (Why 7? Think about the number of cases in the folds.)
    2. How do you think the estimated test error would change with fewer predictors?
    3. Briefly describe how the output of a stepwise selection procedure could help you choose a smaller model (a model with fewer predictors).
    4. This model fit with ordinary least squares corresponds to a special case of penalized least squares. What is the value of (lambda) in this special case?
    5. As (lambda) increases, what would you expect to happen to the number of predictors that remain in the model?

    The code below fits a LASSO model with (lambda = 10) . This value of (lambda) is specified in the tuneGrid argument. The alpha = 1 specifies the LASSO method specifically (the glmnet method has other purposes).

    Fit the LASSO using (lambda=100) .

    How many variables remain in the LASSO model with (lambda=100) ? Is this model “bigger” or smaller than the LASSO with (lambda=10) ? How do the variables’ coefficients compare to the corresponding variables in the least squares model and the LASSO with (lambda=10) ?

    LASSO for a variety of (lambda)
    There are infinitely many (lambda) we could use. It would be too tedious to examine these one at a time. The following code fits LASSO models across a grid of (lambda) values and makes a summary plot of the coefficient estimates as a function of (lambda) .

    • Each colored line corresponds to a different predictor. The small number to the left of each line indicates a predictor by its position in rownames(lasso_mod$finalModel$beta) .
    • The x-axis reflects the range of different (lambda) values considered in lasso_mod (the lambdas vector that we created).
    • At each (lambda) , the y-axis reflects the coefficient estimates for the predictors in the corresponding LASSO model.
    • At each (lambda) , the numbers at the top of the plot indicate how many predictors remain in the corresponding model.

    We can zoom in on the plot by setting the y-axis limits to go from -10 to 10 with ylim as below. Compare the lines for variables 6 and 15. What are variables 6 and 15? Which seems to be a more “important” or “persistent” variable? Does this make sense in context?

    Picking (lambda)
    In order to pick which (lambda) (hence LASSO model) is “best”, we can compare the 7-fold CV error rate for each model. caret has actually done that for us when it train() ed the model. We can look at a plot of those results:

    1. Comment on the shape of the plot. The RMSE’s go down at the very beginning then start going back up. Why do you think that is?
    2. Roughly, what value of (lambda) results in the best model?

    This plot indicates that we tried many (lambda) values that were pretty bad. (Why?) Let’s fit LASSO models over a better grid of (lambda) values. Modify the previous code to use the following grid and remake lasso_mod and the previous plot:

    Picking (lambda) : accounting for uncertainty
    Each of the points on the previous plot arose from taking the mean RMSE over 7 cross-validation iterations. Those 7 RMSE estimates have a standard deviation and standard error too. You can use the custom best_lambdas() function to make a plot of estimated test RMSE versus (lambda) that also shows information about the standard errors.
    In particular, the plot shows points that exactly correspond to the previous plot. The additional lines show 1 standard error above and below the RMSE estimate. In essence, the span of the lines indicates a confidence interval.
    The best_lambdas() function also prints information about some reasonable choices for good (lambda) values.

    1. The first row of printed output shows a choice for (lambda) called lambda_min , the (lambda) at which the observed CV error was smallest. The second row shows a choice called lambda_1se , the largest (lambda) for which the corresponding LASSO model has a CV error that’s still within 1 standard error of that for the LASSO using lambda_min . Explain why we might use the LASSO with lambda_1se instead of lambda_min .
    2. How does the CV-estimated RMSE of these models compare to that of the original ordinary least squares model in exercise 2?

    Look at the coefficients of LASSO models corresponding to both choices of (lambda) . How do the coefficients differ between lambda_min and lambda_1se ? Does one model’s coefficients seem more sensible contextually? The instructor does not have a deep enough understanding of baseball, but you might!


    Programming Praxis

    We saw in the previous exercise that finding an exact solution for the traveling salesman problem is extremely time consuming, taking time O(n!). The alternative is a heuristic that delivers a reasonably good solution quickly. One such heuristic is the “nearest neighbor:” pick a starting point, then at each step pick the nearest unvisited point, add it to the current tour and mark it visited, repeating until there are no unvisited points.

    Your task is to write a program that solves the traveling salesman problem using the nearest neighbor heuristic. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

    Share this:

    Like this:

    Related

    6 Responses to “Traveling Salesman: Nearest Neighbor”

    […] Praxis – Traveling Salesman: Nearest Neighbor By Remco Niemeijer In today’s Programming Praxis exercise we have to implement a significantly faster algorithm for the traveling […]

    […] data. On Programming Praxis they have proposed to resolve the problem using brute force, and using the closest neighbor (a simplification of the […]

    A solution in Python, including comparison with brute force. Comments on my blog on http://wrongsideofmemphis.wordpress.com/2010/03/16/travelling-salesman/

    import random
    import itertools
    import operator
    import datetime

    def random_cities(number):
    ”’ Generate a number of cities located on random places ”’

    cities = [ (random.randrange(0, MAX_X),
    random.randrange(0, MAX_Y))
    for i in range(number) ]

    def path_lenght(path):
    ”’ Get the lenght of a path ”’
    lenght = 0
    for i in xrange(len(path) – 1):
    # Add the distance between two cities
    lenght += abs(complex(path[i][0], path[i][1])
    – complex(path[i + 1][0], path[i + 1][1]))

    def find_path_bruteforce(cities):
    ”’ Find the smallest path using brute force ”’

    for path in itertools.permutations(cities, len(cities)):
    # Get the length of the path, adding the returning point
    total_path = path + (path[0],)
    lenght = path_lenght(total_path)
    lenghts.append((total_path, lenght))

    # Get minimum
    lenghts.sort(key=operator.itemgetter(1))
    return lenghts[0]

    def find_path_nearest(cities):
    ”’ Find the closest neibour ”’

    lenghts = []
    for city in cities:
    lenght = 0
    actual_cities = cities[:]
    actual_city = actual_cities.pop(actual_cities.index(city))
    path = [actual_city, ]
    # Find nearest neibour
    while actual_cities:
    min_lenght = []
    for next_city in actual_cities:
    min_lenght.append((next_city, abs(complex(city[0], city[1])
    – complex(next_city[0], next_city[1]))))
    # Get closest neibor
    min_lenght.sort(key=operator.itemgetter(1))

    actual_city = min_lenght[0][0]
    lenght += min_lenght[0][1]
    actual_cities.pop(actual_cities.index(actual_city))
    path.append(actual_city)

    # Complete the trip with the first city
    path.append(city)

    # Get minimum
    lenghts.sort(key=operator.itemgetter(1))
    return lenghts[0]

    if __name__ == ‘__main__’:
    for i in range(3, 10):
    print ‘Number of cities: ‘, i
    cities = random_cities(i)


    8.1E: Exercises - Mathematics

    Here you can ask our community for a resource to be created that meets your specifications.

    June 11, 2021 Kylie from Australia (AUS) - All States,Australia (AUS) - New asked:

    Name Angles (geometry) Year 5 NSW Syllabus 5 sequenced lessons

    Details I need a sequence of 5 45 minute lesson for a year 5 student. I work in an ssp and program mainly program for high school students. I need it to match with the nsw syllabus

    Resource For: For Student,For Educator

    Location Australia (AUS) - All States,Australia (AUS) - New

    May 16, 2020 Melanie from Australia (AUS) - Western Australia (WA),Wa asked:

    Name K-3 WA curriculum Health work packs

    Details I am looking for packs alligned with the WA curriculum that cover the entirety of the content listed in the health curriculum for K-3. They will be sent out for distance learning.

    Location Australia (AUS) - Western Australia (WA),Wa

    Topic Health,Health and Fitness,Health Education,Health Priorities in Australia,Healthy eating,Healthy Lifestyle,Healthy Living,Curriculum

    Grade Level Pre K,Kindergarten,Year 1,Year 2,Year 3

    Outcomes 1.0,Cover all of the learning area

    December 18, 2017 Kibria from New South Wales asked:

    Name NSW School-Year 3 to Year 6 Reading Mathematics General Ability resources & test

    Details We are looking for high quality customised resources for NSW education curriculumn- Year 3 to Year 6 For 1. Reading 2. Mathematics 3. General Ability resources & test

    Resource For: For Educator

    Grade Level Year 3,Year 4,Year 5,Year 6

    December 13, 2017 Kibria from New South Wales asked:

    Name NSW School-Year 3 to Year 6 Reading Mathematics General Ability resources & test

    Details We are looking for high quality customised resources for NSW education curriculumn- Year 3 to Year 6 For 1. Reading 2. Mathematics 3. General Ability resources & test


    NumPy Logic functions: isclose() function

    The isclose() function is used to returns a boolean array where two arrays are element-wise equal within a tolerance.

    The tolerance values are positive, typically very small numbers.
    The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b.

    Version: 1.15.0

    Name Description DataTypes Required /
    Optional
    a, b Input arrays to compare.
    array_like
    Required
    rtol The relative tolerance parameter. float Required
    atol The absolute tolerance parameter. float Required
    equal_nan Whether to compare NaN’s as equal. If True,
    NaN’s in a will be considered equal to NaN’s in b in the output array.
    bool Required

    Returns:
    y : array_like - Returns a boolean array of where a and b are equal within the given tolerance.
    If both a and b are scalars, returns a single boolean value.

    Notes:
    For finite values, isclose uses the following equation to test whether two floating point values are equivalent.


    Learntofish's Blog

    Posted by Ed on September 20, 2009

    In the following I will explain what Object Oriented Programming means in Java. And you will see that it is easy to understand once you’ve grasped the notion of an object. I encourage you to copy the code and run it in Java.

    What is an object?
    Definition: An object in Object Oriented Programming is characterized by
    a) its attributes (properties)
    b) methods that do something with the attributes (for example change them)

    For example Homer Simpson is an object:
    a) His attributes are: name, age, gender, etc.
    b) And there are methods that you can apply to the object:
    e.g. you can ask for Homer Simpson’s name or you can let him have his birthday that changes his age.

    What is a class?
    There are different kind of objects, for example:
    – Marie Curie, Isaac Newton and Emmy Noether have something in common. They are persons. We say that they are objects of the class “Person”.
    – Apple, banana, mango: These are objects of the class “Fruit”.
    – Jacket, blouse, pullover, pants: These are objects of the class “clothes”.

    Let’s get practical and type the following code in Java (to copy the code: go the the source code window below and point to the right top corner. There, click on the white symbol):

    Let’s examine the program:
    – In line 01 we declare a class “Person”.
    – In line 03-05 we define what kind of attributes (properties) our objects have. Here, our objects shall have a name and an age.
    – In line 07-27 we define the methods that can be applied to our objects For example in line 08-10 the method setName() gives our objects a name.
    – Our main() method (or program) starts in line 29.

    Compiling the code above does not yield any console output since we haven’t written anything in our main() method yet. In the following we will add code to the main() method.

    Creating an object and setting its attributes

    Let’s examine the code:
    – Here, in line 04 we create an object “father” from the class “Person” by typing
    Person father = new Person()
    In general, we create an object by typing
    Class object = new Class()
    Besides, the method Person() is called a “constructor”. The constructor has its name for obvious reasons: we construct an object.

    – In line 07-08 we apply methods to our object “father”. We give him a name and an age:
    father.setName(“Homer”)
    father.setAge(40)

    In general applying a method to an object is done by typing:
    object.method()
    In particular, you have noticed the dot. Whenever you want to apply a method to an object you use the dot.

    Show the attributes on the console output
    Compiling the code above still won’t give us any output on the console. Let’s add some code to the main() method (line 10-12):

    Now, we want the father to tell us his name and age. That is why in line 11-12 we apply the methods getName() and getAge() to our object father. Compiling the code yields the console output:
    My name is Homer.
    I am 40 years old.

    Exercises:
    1a) Using the “Person” class from above create an object “daughter”.
    1b) Give her the name “Lisa Simpson” who is 8 years old.
    1c) “daughter” shall tell us her name and age via the console output.
    1d) Using the “Person” class from above create an object “son” with the attributes name=”Bart Simpson” and age=8.
    1e) Make the son tell us his name. Afterwards, apply the method birthday() to son.

    2a) Create a class “Fruit” with a main() method.
    2b) The objects of this class should have the following attribute: color
    The class has the methods setColor() and getColor().
    2c) Create an object apple. Set its color to “red”. Then show the color on the console output by using the getColor() method.

    The console output for 1c) is: My name is Lisa Simpson. I am 8 years old.

    For 1d) and 1e) the console output is:
    My name is Bart Simpson.
    I’ve just had my birthday! Now, I am 9 years old.
    2a) and 2b)

    For 2c) add the following code to the main() method:

    Here, we get the console output: My color is: red.

    References:
    1) Object-Oriented Programming Concepts
    A tutorial by Sun Microsystems, the Java developer

    4) Object-oriented language basics
    A more advanced tutorial by javaworld.com

    5) Eclipse and Java for Total Beginners
    Video Tutorial on how to use Eclipse with an introduction to Object Oriented Programming


    16.7.1. Model Architectures¶

    In sequence-aware recommendation system, each user is associated with a sequence of some items from the item set. Let (S^u = (S_1^u, . S_<|S_u|>^u)) denotes the ordered sequence. The goal of Caser is to recommend item by considering user general tastes as well as short-term intention. Suppose we take the previous (L) items into consideration, an embedding matrix that represents the former interactions for time step (t) can be constructed:

    where (mathbf in mathbb^) represents item embeddings and (mathbf_i) denotes the (i^mathrm) row. (mathbf^ <(u, t)>in mathbb^) can be used to infer the transient interest of user (u) at time-step (t) . We can view the input matrix (mathbf^<(u, t)>) as an image which is the input of the subsequent two convolutional components.

    The horizontal convolutional layer has (d) horizontal filters (mathbf^j in mathbb^, 1 leq j leq d, h = <1, . L>) , and the vertical convolutional layer has (d') vertical filters (mathbf^j in mathbb^< L imes 1>, 1 leq j leq d') . After a series of convolutional and pool operations, we get the two outputs:

    where (mathbf in mathbb^d) is the output of horizontal convolutional network and (mathbf' in mathbb^) is the output of vertical convolutional network. For simplicity, we omit the details of convolution and pool operations. They are concatenated and fed into a fully-connected neural network layer to get more high-level representations.

    where (mathbf in mathbb^) is the weight matrix and (mathbf in mathbb^k) is the bias. The learned vector (mathbf in mathbb^k) is the representation of user’s short-term intent.

    At last, the prediction function combines users’ short-term and general taste together, which is defined as:

    (16.7.4)¶ [hat_ = mathbf_i cdot [mathbf, mathbf

    _u]^ op + mathbf'_i,]

    where (mathbf in mathbb^) is another item embedding matrix. (mathbf' in mathbb^n) is the item specific bias. (mathbf

    in mathbb^) is the user embedding matrix for users’ general tastes. (mathbf

    _u in mathbb^< k>) is the (u^mathrm) row of (P) and (mathbf_i in mathbb^<2k>) is the (i^mathrm) row of (mathbf) .

    The model can be learned with BPR or Hinge loss. The architecture of Caser is shown below:

    Fig. 16.7.1 Illustration of the Caser Model ¶

    We first import the required libraries.


    8.1E: Exercises - Mathematics

    Analysis of Introgression with SNP Data

    A tutorial on the analysis of hybridization and introgression with SNP data by Milan Malinsky ([email protected]) and Michael Matschiner

    Admixture between populations and hybridisation between species are common and a bifurcating tree is often insufficient to capture their evolutionary history (example). Patterson’s D, also known as ABBA-BABA, and the related estimate of admixture fraction f, referred to as the f4-ratio are commonly used to assess evidence of gene flow between populations or closely related species in genomic datasets. They are based on examining patterns of allele sharing across populations or closely related species. Although they were developed in within a population genetic framework the methods can be successfully applied for learning about hybridisation and introgression within groups of closely related species, as long as common population genetic assumptions hold – namely that (a) the species share a substantial amount of genetic variation due to common ancestry and incomplete lineage sorting (b) recurrent and back mutations at the same sites are negligible and (c) substitution rates are uniform across species.

    Patterson's D and related statistics have also been used to identify introgressed loci by sliding window scans along the genome, or by calculating these statistics for particular short genomic regions. Because the D statistic itself has large variance when applied to small genomic windows and because it is a poor estimator of the amount of introgression, additional statistics which are related to the f4-ratio have been designed specifically to investigate signatures of introgression in genomic windows along chromosomes. These statistics include fd (Martin et al., 2015), its extension fdM (Malinsky et al., 2015), and the distance fraction df (Pfeifer & Kapan, 2019).

    In this tutorial, we are first going to use simulated data to demonstrate that, under gene-flow, some inferred species relationships might not correspond to any real biologial relationships. Then we are going to use Dsuite, a software package that implements Patterson’s D and related statistics in a way that is straightforward to use and computationally efficient. This will allow us to identify admixed taxa. While exploring Dsuite, we are also going to learn or revise concepts related to application, calculation, and interpretation of the D and of related statistics. Next we apply sliding-window statistics to identify particular introgressed loci in a real dataset of Malawi cichlid fishes. Finally, we look at the same data that was used for species-tree inference with SVDQuartets in tutorial Species-Tree Inference with SNP Data to see if we can reach the same conclusions as a 2016 manuscript which used a more limited dataset with fewer species.

    Students who are interested can also apply ancestry painting to investigate a putatitve case of hybrid species.

    Dsuite: The Dsuite program allows the fast calculation of the D-statistic from SNP data in VCF format. The program is particularly useful because it automatically calculates the D-statistic for all possible species trios, optionally also in a way that the trios are compatible with a user-provided species tree. Instructions for download and installation on Mac OS X and Linux are provided on https://github.com/millanek/Dsuite. Installation on Windows is not supported, but Windows users can use the provided output files to learn how to plot and analyze the Dsuite output.

    FigTree: The program FigTree should already be installed if you followed the tutorials Bayesian Phylogenetic Inference, Phylogenetic Divergence-Time Estimation or other tutorials. If not, you can download it for Mac OS X, Linux, and Windows from https://github.com/rambaut/figtree/releases.

    pypopgen3: (with dependencies, uncluding msprime) Pypopgen3 provides various useful population genetics tools, including, importantly, a wrapper for the msprime program to allow convenient simulations of phylogenomic data.

    1. Inferring the species-tree and gene-flow from a simulated dataset

    1.1 Simulating phylogenomic data with msprime

    One difficulty with applying and comparing different methods in evolutionary genomics and phylogenomics is that we rarely know what the right answer is. If methods give us conflicting answers, or any answers, how do we know if we can trust them? One approach that is often helpful is the use of simulated data. Knowing the truth allows us to see if the methods we are using make sense.

    One of fastest software packages around for simulating phylogenomic data is the coalescent-based msprime (manuscript). The msprime manuscript and the software itself are presented in population genetic framework. However, we can use it to produce phylogenomic data. This is because, from the point of view of looking purely at genetic data, there is no fundamental distinction between a set of allopatric populations of a single species and a set of different species. The genealogical processes that play out across different population are indeed the same as the processes that determine the genetic relationships along-the-genome of any species that may arise. We will return to this theme of a continuum between population genetics and phylogenomics later.

    We have simulated SNP data for 20 species in the VCF format, two individuals from each species. The species started diverging 1 million years ago, with effective population sizes on each branch set to 50,000. Both the recombination and mutation rates were set to 1e-8 and 20Mb of data were simulated. Because these simulations take some time to run, we have the ready simulated data available for you. First a simulation without gene-flow (VCF, true tree: image, newick, json), and second, a simulation where five gene-flow events have been added to a tree (VCF, true tree with gene-flow: image, newick, json). Details for how to generate such simulated datasets are provided below.

    Generating simulated phylogenomic data with msprime

    It is in principle possible to simulate data from an arbitrary phylogeny with msprime, but specifying the phylogenetic tree directly in the program is complicated. Therefore, a number of 'helper' wrapper programs have been developed that can make this task much easier for us. For this exercise, we use Hannes Svardal's pypopgen3. After installing pypogen3 and its dependencies, using the instructions on the webpage, we simulated the data using the following code:

    Note that if you run this code yourselves, the data and the trees that you get out will be somewhat different from the ones we prepared for you, because the trees are randomly generated and the coalescent simulation run by msprime is also a stochastic process. Therefore, every simulation will be different. If you want to simulate more data using the trees we provided to you, you would replace the treetools.get_random_geneflow_species_tree commands with the following code to read the provided trees:

    1.2 Reconstructing phylogenies from simulated data

    Now we apply the phylogentic (or phylogenomic) approaches that we have learned to the simulated SNP data to see if we can recover the phylogentic trees that were used as input to the simulations. As in the tutorial on Species-Tree Inference with SNP Data, we are going to use algorithms implemented in PAUP*.

    Our msprime simulation did not use any specific substitution model for mutations, but simply designated alleles as 0 for ancestral and 1 for derived. The alleles are indicated in the fourth (REF) and fifth (ALT) column of the VCF as per the VCF file format. To use PAUP* we first need to convert the the VCF into the Nexus format, and this needs the 0 and 1 alleles to be replaced by actual DNA bases. We can use the vcf2phylip.py python script and achieve these steps as follows, first for the dataset simulated without gene-flow:

    Next, open the Nexus file chr1_no_geneflow_nt.min4.nexus in PAUP*, again making sure that the option "Execute" is set in the opening dialog, as shown in the screenshot.

    Then designate the outgroup (Data->Define_outgroup) as you learned in the tutorial on Species-Tree Inference with SNP Data.

    Then use the Neighbor Joining algorithm (Analysis->Neighbor-Joining/UPGMA) with default parameters to build a quick phylogeny.

    As you can see by comparison of the tree you just reconstructed (also below) against the input tree, a simple Neigbor Joining algorithm easily reconstructs the tree topology perfectly, and even the branch lengths are almost perfect.

    Click here to see the reconstructed NJ tree without gene-flow

    Now we repeat the same tree-reconstruction procedure for the simulation with gene-flow, starting with file format conversion:

    Then load the file with_geneflow_nt.min4.nexus into PAUP*, again making sure the option "Execute" is set, then designate the outgroup, and finally run the Neighbor Joining tree reconstruction. You should get a tree like the one below:

    An examination of this reconstructed tree reveals that in this case we did not recover the topology of the true tree used as input to the simulation. Unlike in the true tree, in the reconstructed tree species S14 is "pulled outside" the group formed by S13,S15,S16 . This is most likely because of the gene-flow that S14 received from S00 . This is a typical pattern: when one species from within a group receives introgression from another group, it tends to be "pulled out" like this in phylogenetic reconstruction. One argument that could be made here is that the Neighbor Joining algorithm is outdated, and that perhaps newer, more sophisticated, methods would recover the correct tree. You can now try to apply SVDQuartets in PAUP*, and also try any of the other phylogenomic methods you know to see if any of these will succeed.

    Click here to see the reconstructed SVDQuartets tree with gene-flow

    As you can see, the topology is in fact different from the Neighbor Joining, but also is not correct ( S13,S14 should not be sister taxa, also S10 and S11 are swapped) .

    1.3 Testing for gene-flow in simulated data

    Under incomplete lineage sorting alone, two sister species are expected to share about the same proportion of derived alleles with a third closely related species. Thus, if species "P1" and "P2" are sisters and "P3" is a closely related species, then the number of derived alleles shared by P1 and P3 but not P2 and the number of derived alleles that is shared by P2 and P3 but not P1 should be approximately similar. In contrast, if hybridization leads to introgression between species P3 and one out the two species P1 and P2, then P3 should share more derived alleles with that species than it does with the other one, leading to asymmetry in the sharing of derived alleles. These expectations are the basis for the so-called "ABBA-BABA test" (first described in the Supporting Material of Green et al. 2010) that quantifies support for introgression by the D-statistic. Below is an illustration of the basic principle.

    In short, if there is gene-flow between P2 <-> P3, there is going to be an excess of the of the ABBA pattern, leading to positive D statistics. In contrast, gene-flow between P1 <-> P3 would lead to a an excess of the BABA pattern and a negative D statistic. However, whether a species is assigned in the P1 or P2 position is arbitrary, so we can always assign them so that P2 and P3 share more derived alleles and the D statistic is then bounded between 0 and 1. There is also a related and somewhat more complicated measure, the f4-ratio, which strives to estimate the admixture proportion in percentage. We will not go into the maths here - if you are interested, have a look at the Dsuite paper.

    The Dsuite software has several advantages: it brings several related statistics together into one software package, has a straightforward workflow to calculate the D statistics and the f4-ratio for all combinations of trios in the dataset, and the standard VCF format, thus generally avoiding the need for format conversions or data duplication. It is computationally more efficient than other software in the core tasks, making it more practical for analysing large genome-wide data sets with tens or even hundreds of populations or species. Finally, Dsuite implements the calculation of the fdM and f-branch statistics for the first time in publicly available software.

    To calculate the D statistics and the f4-ratio for all combinations of trios of species, all we need is the file that specifies what individuals belong to which population/species - we prepared it for you: species_sets.txt. Such a file could be simply prepared manually, but, in this case we can save ourselves the work and automate the process using a combination of bcftools and awk :

    Something similar to the above can be useful in many cases, depending on how the individuals are named in your VCF file.

    Then, to familiarize yourself with Dsuite, simply start the program with the command Dsuite . When you hit enter, you should see a help text that informs you about three different commands named "Dtrios", "DtriosCombine", and "Dinvestigate", with short descriptions of what these commands do. Of the three commands, we are going to focus on Dtrios, which is the one that calculates the D-statistic for all possible species trios.

    To learn more about the command, type Dsuite Dtrios and hit enter. The help text should then inform you about how to run this command. There are numerous options, but the defaults are approprite for a vast majority of use-cases. All we are going to do is to provide a run name using the -n option, the correct tree using the -t option, and use the -c option to indicate that this is the entire dataset and, therefore, we don't need intermediate files for "DtriosCombine".

    1.3.1 Do we find geneflow in data simulated without geneflow?

    We run Dsuite for the dataset without gene-flow as follows:

    The run takes about 50 minutes. Therefore, we already put the output files for you in the data folder. Let's have a look at the first few lines of species_sets_no_geneflow_BBAA.txt :

    Each row shows the results for the analysis of one trio. For example in the first row, species S01 was used as P1, S02 was considered as P2, and S00 was placed in the position of P3. Then we see the D statistic, associated Zscore and p-value, the f4-ratio estimating admixture proportion and then the counts of BBAA sites (where S01 and S02 share the derived allele) and then the counts of ABBA and BABA sites. As you can see, ABBA is always more than BABA and the D statistic is always positive because Dsuite orients P1 and P2 in this way. Since these results are for coalescent simulations without gene-flow, the ABBA and BABA sites arise purely through incomplete lineage sorting and the difference between them is purely random - therefore, even though the D statistic can be quite high (e.g. up to 8% on the last line), this is not a result of gene flow.

    Question 1: Can you tell why the BBAA, ABBA, and BABA numbers are not integer numbers but have decimal positions?

    Click here to see the answer

    Integer counts of ABBA and BABA sites would only be expected if each species would be represented only by a single haploid sequence. With diploid sequences and multiple samples per species, allele frequences are taken into account to weigh the counts of ABBA and BABA sites as described by equations 1a to 1c of the Dsuite paper.

    Question 2: How many different trios are listed in the file? Are these all possible (unordered) trios?

    Click here to see the answer

    Because each trio is listed on a single row, the number of trios in file species_sets_no_geneflow_BBAA.txt is identical to the number of lines in this file. This number can easily be counted using, e.g. the following command:

    You should see that the file includes 1140 lines and therefore results for 1140 trios. Given that the dataset includes (except the outgroup species) 20 species and 3 of these are required to form a trio, the number of possible trios is

    In species_sets_no_geneflow_BBAA.txt , trios are arranged so that P1 and P2 always share the most derived alleles (BBAA is always the highest number). There are two other output files: one with the _tree.txt suffix: species_sets_no_geneflow_tree.txt where trios are arranged according to the tree we gave Dsuite, and a file with the _Dmin.txt suffix species_sets_no_geneflow_Dmin.txt where trios are arranged so that the D statistic is minimised - providing a kind of "lower bound" on gene-flow statistics. This can be useful in cases where the true relationships between species are not clear, as illustrated for example in this paper (Fig. 2a).

    Let's first see how the other outputs differ from the _tree.txt file, which has the correct trio arrangments. :

    There is one difference in the _BBAA.txt ile. Because of incomplete lineage sorting being a stochastic (random) process, we see that S08 and S10 share more derived alleles than S09 and S10 , which are sister species in the input tree. If you look again at the input tree, you will see that the branching order between S08 , S09 and S10 is very rapid, it is almost a polytomy.

    A comparison of the _tree.txt file against the _Dmin.txt , which minimises the Dstatistic, reveals nine differences. However, the correct trio arrangements in all these cases are very clear.

    Next, let's look at the results in more detail, for example in R. We load the _BBAA.txt file and first look at the distribution of D values:

    There are some very high D statistics. In fact, the D statistics for 9 trios are >0.7, which is extremely high. So how is this possible in a dataset simulated with no geneflow?

    These nine cases arise because there is amost no incomplete lineage sorting among these trios almost all sites are BBAA - e.g. 179922 sites for the first trio, while the count for ABBA is only 1.5 and for BABA it is 0 . The D statistic is calculated as D = (ABBA-BABA)/(ABBA+BABA), which for the first trio would be D = (1.5-0)/(1.5+0)=1. So, the lesson here is that the D statistic is very sensitive to random fluctuations when there is a small number of ABBA and BABA sites. One certainly cannot take the D value seriously unless it is supported by a statistical test suggesting that the D is significanly different from 0. In the most extreme cases above, the p-value could not even be calculated, becuase there were so few sites. Those definitely do not represent geneflow. But in one case we see a p value of 0.0018. Well, that looks significant, if one considers for example the traditional 0.05 cutoff. So, again, how is this possible in a dataset simulated with no geneflow?

    In fact, there are many p values that are <0.05. For those who have a good understanding of statistics this will be not be suprising. This is because p values are uniformly distributed when the null hypopthesis is true. Therefore, we expect 5% of the (or 1 in 20) p-values, they will be <0.05. If we did a 1140 tests, we can expect 57 of them to be <0.05. Therefore, any time we conduct a large amount of statistical tests, we should apply a multiple testing correction - commonly used is the Benjamini-Hochberg (BH) correction which controls for the false discovery rate.

    However, even after applying the BH correction there are three p-values which look significant. These are all false discoveries. Here comes an important scientific lesson - that even if we apply statistical tests correctly, seeing a p-value below 0.05 is not a proof that the null hypothesis is false. All hypothesis testing has false positives and false negatives. It may be helpful to focus less on statistical testing and aim for a more nuanced understanding of the dataset, as argued for example in this Nature commentary.

    Therefore, we should also plot the f4-ratio, which estimates the proportion of genome affected by geneflow. It turns out that this is perhaps the most reliable - all the f4-ratio values are tiny, as they should be for a dataset without geneflow.

    Finally, we use visualisation heatmap in which the species in positions P2 and P3 are sorted on the horizontal and vertical axes, and the color of the corresponding heatmap cell indicates the most significant D-statistic found between these two species, across all possible species in P1. To prepare this plot, we need to prepare a file that lists the order in which the P2 and P3 species should be plotted along the heatmap axes. The file should look like plot_order.txt . You could prepare this file manually, or below is a programmatic way:

    Then make the plots using the scripts plot_d.rb and plot_f4ratio.rb .

    1.3.2 Do we find geneflow in data simulated with geneflow?

    How do the results for the simulation with geneflow differ from the above? Here we are going to run a similar set of analyses and make comparisons. We run Dsuite for the dataset with gene-flow as follows:

    Now we find 39 differences between the _tree.txt file and the _BBAA.txt , reflecting that, under geneflow, sister species often do not share the most derived alleles. Between _tree.txt file and the _Dmin.txt there are 124 differences.

    We can visualise the overlap between these different files using a Venn diagram. For example in R:

    Then we explore the results in the _BBAA.txt in R, analogously to how we did it above for the no-geneflow case:

    As you can see if you click above, the distributions of the statistics are markedly different when compared against the no-geneflow scenario. The number of D statistics >0.7 here is 85 (compared with 9 under no-geneflow) and very many trios now have significant p values - even after FDR correction, we have p<0.05 for whopping 671 trios. Finally, the f4-ratios are also elevated, up to almost 15% in some cases.

    To remind ourselves, the simulated tree and geneflow events are shown on the left. The 15% f4-ratios estimates correspond reasonably well with the strength of the geneflow events that we simulated (in the region of 8% to 18%). However, we simulated only five geneflow events and have 671 significant p values and 138 f4-ratio values above 3%. This is because the test statistics are correlated when trios share an (internal) branch in the overall population or species tree. Therefore, a system of all possible four taxon tests across a data set can be difficult to interpret. In any case (and with any methods) pinpointing specific introgression events in data sets with tens or hundreds of populations or species remains challenging - especially when genetic data is all the evidence we have.

    The scripts plot_d.rb and plot_f4ratio.rb were originally developed to help with such interpretation, and can still be useful. Because they take the maximum D or f4-ratio value between species in the P2 and P3 positions, across all possible species in P1, the plot deals with some of the correlated signals and redundancy in the data by focusing on the overall support for geneflow between pairs of species or their ancestors, which could have happened at any time in the past since the species in P2 and P3 positions diverged from each other.

    Question 3: How informative are the plots above? Can you identify the gene flow events from the plots?

    As an upgrade on the above plots we developed with Hannes Svardal, the f-branch or fb(C) metric (introduced in Malinsky et al. (2018). This is designed to disentangle correlated f4-ratio results and, unlike the matrix presentation above, f-branch can assign gene flow to specific, possibly internal, branches on a phylogeny. The f-branch metric builds upon and formalises verbal arguments employed by Martin et al. (2013), who used these lines of reasoning to assign gene flow to specific internal branches on the phylogeny of Heliconius butterflies.

    The logic of f-branch is illustated in the following figure. The panel (c) provides an example illustrating interdependences between different f4-ratio scores, which can be informative about the timing of introgression. In this example, different choices for the P1 population provide constraints on when the gene flow could have happened. (d) Based on relationships between the f4-ratio results from different four taxon tests, the f-branch, or fb statistic, distinguishes between admixture at different time periods, assigning signals to different (possibly internal) branches in the population/species tree

    This is implemented in the Dsuite Fbranch subcommand, and the plotting utility, called dtools.py is in the utils subfolder of the Dsuite package.

    The second command creates a file called fbranch.png , which is shown below.

    Question 4: Can you identify the gene flow events clearer here than from the matrix plots above? Is this a good showcase for the f-branch method?

    Question 5: If you exclude species with the strongest f4-ratio of f-branch signals, can you then get a correct phylogeny from PAUP*?

    Question 6: What happens when you re-run Dsuite with the inferred (wrong) tree from PAUP*?

    2. Finding specific introgressed loci - adaptive introgression in Malawi cichlids

    This exercise is based on analysis from the Malinsky et al. (2018) manuscript published in Nature Ecol. Evo.. The paper shows that two deep water adapted lineages of cichlids share signatures of selection and very similar haplotypes in two green-sensitive opsin genes (RH2Aβ and RH2B). The genes are located next to each other on scaffold_18. To find out whether these shared signatures are the result of convergent evolution or of adaptive introgression, we used the f_dM statistic. The f_dM is related to Patterson’s D and to the f4-ratio, but is better suited to analyses of sliding genomic windows. The calculation of this statistic is implemented in the program Dsuite Dinvestigate .

    The data for this exercise are in the data folder. It includes the VCF file with variants mapping to the scaffold_18 of the Malawi cichlid reference genome we used at the time - scaffold_18.vcf.gz . There are also two other files required to run Dinvestigate: the “SETS” file and the “test_trios” file. In this case they are called: MalawiSetsFile.txt and MalawiTestTriosForInvestigate.txt . The “TestTrios” file specifies that we want to investigate the admixture signal between the Diplotaxodon genus and the deep benthic group, compared with the mbuna group. The “SETS” file specifies which individuals belong to these groups. Finally, the command to execute the analysis is:

    The -w 50,25 option specifies that the statistics should be averaged over windows of 50 informative SNPs, moving forward by 25 SNPs at each step. The run should take a little under 10 minutes. We suggest you have a tea/coffee break while you wait for the results ).

    Question 7: What are the overall D and f_dM values over the entire scaffold_18? What does this suggest?

    The results are output by Dsuite into the file mbuna_deep_Diplotaxodon_localFstats__50_25.txt . A little R plotting function plotInvestigateResults.R is prepared for you. Use the script to load in the file you just produced (line 3) and plot the D statistic (line 6). Also execute line 8 of the script to plot the f_dM values. Do you see any signal near the opsin coordinates? We also plot the f_d statistic. As you can see, the top end of the plot is the same as for the f_dM, but the f_d is asymmetrical, extending far further into negative values.

    Question 8: Do you see any interesting signal in the D, f_dM, and f_d statistics? The opsin genes are located between 4.3Mb and 4.4Mb. Do you see anything interesting there?

    Click here to see the resulting R plots

    You could also plot the new d_f statistic? Doe that look any better?

    Finally, we zoom in at the region of the opsin genes (line 12). As you can see, the results look like a single “mountain” extending over 100kb.

    Click here to see the zoom in with `-w 50,25`

    But there is more structure than that in the region. Perhaps we need to reduce the window or step size to see a greater level of detai.

    They can be plotted with the same R script. Have a look at the results.

    Click here to see the zoom in with `-w 50,5`

    Click here to see the zoom in with `-w 50,1`

    Click here to see the zoom in with `-w 10,1`

    Click here to see the zoom in with `-w 2,1`

    Question 9: What combination of window size/step seems to have the best resolution? Why is the smallest window so noisy?

    Question 10: What happens if you plot individual data points, instead of a continuous line? Are the results clearer?

    Click here to see the zoom in with `-w 10,1` and individual data points

    3. Finding geneflow in a real dataset - Tanganyikan cichlids

    In this execise, we are going to see if we can reproduce the findings reported by Gante et al. (2016), with a different dataset. The Gante et al. dataset contained whole genome sequence data from five species from the cichlid genus Neolamprologus. The authors analysed these data and reported conclusions that are summarised by the figure below:

    A dataset containing these species, but also six additional Neolamprologus species (for a total of 11) was used in the tutorials on Species-Tree Inference with SNP Data and Divergence-Time Estimation with SNP Data.

    Question 11: Are the trees you reconstructed in these exercises consistent with the relationships reported by Gante et al.?

    Here we use data with 10 Neolamprologus species (the clearly hybrid Neolamprologus cancellatus removed), to reassess the evidence for geneflow within this genus with the f4-ratio and f-branch statistics. The genetic data are in NC_031969.vcf.gz , the file specifying sample->species relationships is NC_031969_sets.txt and the tree topology hypothesis is in SNAPP_tree.txt . We run the analysis for all possible trios as follows:

    This should finish in a couple of minutes. There are 'only' ten species, so 120 trios. Could this be manageable? Have a look at the output file TanganyikaCichlids/NC_031969_sets__tree.txt and see if you can interpret the results. Chances are that is is still too complex to interpret the results for the trios just by looking at them. Perhaps you can try the ‘f-branch’ method:

    Question 12: Are the geneflow signals seen here consistent with the Gante et al. figure?

    Question 13: What happens when we focus only on the five species from Gante et al. and exclude all others?

    Notice the -n option to dtools.py , to specify the output file name, making sure that our previous plots are not overwritten. Below is the plot, after a little editing in Inkscape.

    A very simple alternative way of investigating patterns of ancestry in potentially introgressed or hybrid species is to "paint" their chromosomes according to the genotypes carried at sites that are fixed between the presumed parental species. This type of plot, termed "ancestry painting" was used for example by Fu et al. (2015 Fig. 2) to find blocks of Neanderthal ancestry in an ancient human genome, by Der Sarkassian et al. (2015 Fig. 4) to investigate the ancestry of Przewalski's horses, by Runemark et al. (2018 Suppl. Fig. 4) to assess hybridization in sparrows, and by Barth et al. (2019 Fig. 2) to identify hybrids in tropical eels.

    If you haven't seen any of the above-named studies, you might want to have a look at the ancestry-painting plots in some of them. You may note that the ancestry painting in Fu et al. (2015 Fig. 2) is slightly different from the other two studies because no discrimination is made between heterozygous and homozygous Neanderthal alleles. Each sample in Fig. 2 of Fu et al. (2015) is represented by a single row of cells that are white or colored depending on whether or not the Neanderthal allele is present at a site. In contrast, each sample in the ancestry paintings of Der Sarkassian et al. (2015 Fig. 4), Runemark et al. (2018 Suppl. Fig. 4), and Barth et al. (2019 Fig. 2) is drawn with two rows of cells. However, as the analyses in both studies were done with unphased data, these two rows do not represent the two haplotypes per sample. Instead, the two cells per site were simply both colored in the same way for homozygous sites or differently for heterozygous sites without regard to haplotype structure.

    Here, we are going to use ancestry painting to investigate ancestry in Neolamprologus cancellatus ("neocan"), assuming that it is a hybrid between the parental species Altolamprologus fasciatus ("altfas") and Telmatochromis vittatus ("telvit"). As in Der Sarkassian et al. (2015 Fig. 4), Runemark et al. (2018 Suppl. Fig. 4), and Barth et al. (2019 Fig. 2), we are going to draw two rows per sample to indicate whether genotypes are homozygous or heterozygous.

    To generate an ancestry painting, we will need the data file NC_031969.f5.sub1.vcf.gz and will run two Ruby scripts. The first of these, get_fixed_site_gts.rb determines the alleles of the putative hybrid species at sites that are fixed differently in the two putative parental species. The second script, plot_fixed_site_gts.rb then uses the output of the first script to draw an ancestry painting. As the first script requires an uncompressed VCF file as input, first uncompress the VCF file for the SNP dataset with the following command:

    Then, run the Ruby script get_fixed_site_gts.rb to determine the alleles at sites that are fixed differently in the two parents. This script expects six arguments these are

    • the name of the uncompressed VCF input file, [ NC_031969.f5.sub1.vcf ,
    • the name of an output file, which will be a tab-delimited table,
    • a string of comma-separated IDs of samples for the first putative parent species,
    • a string of comma-separated IDs of samples for the putative hybrid species,
    • another string of comma-separated IDs of samples for the second putative parent species,
    • a threshold value for the required completeness of parental genotype information so that sites with too much missing data are discarded.

    We'll use NC_031969.f5.sub1.vcf as the input and name the output file pops1.fixed.txt . Assuming that the parental species are Altolamprologus fasciatus ("altfas") and Telmatochromis vittatus ("telvit") and the hybrid species is Neolamprologus cancellatus ("neocan"), we'll specify the sample IDs for these species with the strings "AUE7,AXD5", "JBD5,JBD6", and "LJC9,LJD1". Finally, we'll filter for sites without missing data by specifying "1.0" as the sixth argument. Thus, run the script get_fixed_site_gts.rb with the following command:

    The second script, plot_fixed_site_gts.rb , expects four arguments, which are

    • the name of the file written by script get_fixed_site_gts.rb ,
    • the name of an output file which will be a plot in SVG format,
    • a threshold value for the required completeness, which now applies not only to the parental species but also to the putative hybrid species,
    • the minimum chromosomal distance in bp between SNPs included in the plot. This last argument aims to avoid that the ancestry painting is overly dominated by high-divergence regions.

    We'll use the file pops1.fixed.txt as input, name the output file pops1.fixed.svg , require again that no missing data remains in the output, and we'll thin the remaining distances so that those plotted have a minimum distance of 1,000 bp to each other. Thus, use the following command to draw the ancestry painting:

    The screen output of this script will include some warnings about unexpected genotypes, these can be safely ignored as the script automatically excludes those sites. At the very end, the output should indicate that 6,069 sites with the required completeness were found, these are the sites included in the ancestry painting. Th output also reports, for all analyzed specimens, the heterozygosity at those 6,069 sites. For first-generation hybrids, this heterozygosity is expected to be close to 1.

    Open the file pops1.fixed.svg with a program capable of reading files in SVG format, for example with a browser such as Firefox or with Adobe Illustrator. You should see a plot like the one shown below.

    In this ancestry painting, the two samples of the two parental species are each drawn in solid colors because all included sites were required to be completely fixed and completely without missing data. The samples of Neolamprologus cancellatus, "LJC9" and "LJD1" are drawn in between, with two rows per sample that are colored according to genotypes observed at the 6,069 sites. Keep in mind that even though the pattern may appear to show phased haplotypes, this is not the case instead the bottom row for a sample is arbitrarily colored in red and the top row is colored in blue when the genotype is heterozygous.

    Question 14: Do you notice any surprising difference to the ancestry plots of Der Sarkassian et al. (2015 Fig. 4) and Runemark et al. (2018 Suppl. Fig. 4)?

    Click here to see the answer

    One remarkable difference compared to the ancestry painting of Der Sarkassian et al. (2015 Fig. 4) and Runemark et al. (2018 Suppl. Fig. 4) is that almost no homozygous genotypes are observed in the two samples of Neolamprologus cancellatus: the bottom rows are drawn almost entirely in red for the two putative hybrid individuals and the top rows are almost entirely in blue. The same pattern, however, can be found in Barth et al. (2019 Fig. 2).

    Question 15: How can this difference be explained?

    The fact that both Neolamprologus cancellatus samples are heterozygous for basically all sites that are differentially fixed in the two parental species can only be explained if both of these samples are in fact first-generation hybrids. If introgression would instead be more ancient and backcrossing (or recombination within the hybrid population) had occurred, we would expect that only certain regions of the chromosome are heterozygous while others should be homozygous for the alleles of one or the other of the two parental species. However, unlike in cases where the genomes have been sequenced of parent-offspring trios, we do not know who the actual parent individuals were. We can guess that the parent individuals were members of the species Altolamprologus fasciatus and Telmatochromis vittatus, but whether the parental individuals were part of the same population as the sampled individuals or a more distantly related population within these species remains uncertain.


    Programming Praxis

    We represent a point as a three-slot vector with the point number (on the range 0 to n-1) in slot 0, x-coordinate in slot 1, and y-coordinate in slot 2. Here are convenience functions:

    (define (n p) (vector-ref p 0))
    (define (x p) (vector-ref p 1))
    (define (y p) (vector-ref p 2))

    A traveling salesman problem is a list of points. We make a problem with make-tsp :

    (define (make-tsp n)
    (define n10 (* n 10))
    (let loop ((n (- n 1)) (ps '()))
    (if (negative? n) ps
    (let ((p (vector n (randint n10) (randint n10))))
    (if (member p ps) (loop n ps)
    (loop (- n 1) (cons p ps)))))))

    We compute distances as they are used, caching them in a global variable dists , which is a two-dimensional matrix initialized in the main solving program. We store the distance twice, in both directions, because the space cost is very little (we allocate the entire matrix), and it saves us the trouble of figuring out a canonical direction:

    (define (dist a b)
    (define (square x) (* x x))
    (when (negative? (matrix-ref dists (n a) (n b)))
    (let ((d (sqrt (+ (square (- (x a) (x b)))
    (square (- (y a) (y b)))))))
    (matrix-set! dists (n a) (n b) d)
    (matrix-set! dists (n b) (n a) d)))
    (matrix-ref dists (n a) (n b)))

    Given a point p and a list of unvisited points ps, function nearest finds the nearest unvisited point:

    (define (nearest p ps)
    (let loop ((ps ps) (min-p #f) (min-d #f))
    (cond ((null? ps) min-p)
    ((or (not min-d) (< (dist p (car ps)) min-d))
    (loop (cdr ps) (car ps) (dist p (car ps))))
    (else (loop (cdr ps) min-p min-d)))))

    We are ready for the solver. Tsp initializes the dists matrix, then enters a loop in which it tracks both the current tour and the list of unvisited points at each step of the loop, it calculates the nearest neighbor, adds that point to the current tour and removes it from the list of unvisited points, and loops, stopping when the tour is complete. The initial point is always the first point in the input:

    (define (tsp ps)
    (let ((len (length ps)))
    (set! dists (make-matrix len len -1)))
    (let loop ((tour (list (car ps))) (unvisited (cdr ps)))
    (if (null? unvisited) tour
    (let ((next (nearest (car tour) unvisited)))
    (loop (cons next tour) (remove next unvisited))))))

    Here we compute the cost of the tour:

    (define (cost tour)
    (if (or (null? tour) (null? (cdr tour))) 0
    (let ((start (car tour)))
    (let loop ((tour tour) (sum 0))
    (if (null? (cdr tour))
    (+ sum (dist (car tour) start))
    (loop (cdr tour) (+ sum (dist (car tour) (cadr tour)))))))))